The problem of deciding satisfiability of a boolean formula is extensively studied in computer science. It appears prominently, as a prototypical NP-complete problem, in the investigations of computational complexity classes. It is studied by the automated theorem proving community. It is also of substantial interest to the AI community due to its applications in several areas including knowledge representation, diagnosis and planning.
Deciding satisfiability of a boolean formula is an NP-complete problem. Thus, it is unlikely that sound and complete algorithms running in polynomial time exist. However, recent years brought several significant advances. First, fast (although, clearly, still exponential in the worst case) implementations of the celebrated Davis-Putnam procedure [DP60] were found. These implementations are able to determine in a matter of seconds the satisfiability of critically constrained CNF formulas with 300 variables and thousands of clauses [DABC96]. Second, several fast randomized algorithms were proposed and thoroughly studied [SLM92, SKC96, SK93, MSG97, Spe96]. These algorithms randomly generate valuations and then apply some local improvement method in an attempt to reach a satisfying assignment. They are often very fast but they provide no guarantee that, given a satisfiable formula, a satisfying assignment will be found. That is, randomized algorithms, while often fast, are not complete. Still, they were shown to be quite effective and solved several practical large-scale satisfiability problems [KS92].
One of the most extensively studied randomized algorithms recently is GSAT [SLM92]. GSAT was shown to outperform the Davis-Putnam procedure on randomly generated 3-CNF formulas from the crossover region [SLM92]. However, GSAT’s performance on structured formulas (encoding coloring and planning problems) was poorer [SKC96, SK93, SKC94]. The basic GSAT algorithm would often become trapped within local minima and never reach a solution. To remedy this, several strategies for escaping from local minima were added to GSAT yielding its variants: GSAT with averaging, GSAT with clause weighting, GSAT with random walk strategy (RWS-GSAT), among others [SK93, SKC94]. GSAT with random walk strategy was shown to perform especially well. These studies, while conducted on a wide range of classes of formulas rarely address a critical issue of the likelihood that GSAT will find a satisfying assignment, if one exists, and the running time is studied without a reference to this likelihood. Notable exceptions are [Spe96], where RWS-GSAT is compared with a simulated annealing algorithm SASAT, and [MSG97], where RSW-GSAT is compared to a tabu search method.
In this paper, we propose a systematic approach for studying the quality of randomized algorithms. To this end, we introduce the concepts of the accuracy and of the running time relative to the accuracy. The accuracy measures how likely it is that a randomized algorithm finds a satisfying assignment, assuming that the input formula is satisfiable. It is clear that the accuracy of GSAT (and any other similar randomized algorithm) grows as a function of time — the longer we let the algorithm run, the better the chance that it will find a satisfying valuation (if one exists). In this paper, we present experimental results that allow us to quantify this intuition and get insights into the rate of growth of the accuracy.
The notion of the running time of a randomized algorithm has not been rigorously studied. First, in most cases, a randomized algorithm has its running time determined by the choice of parameters that specify the number of random guesses, the number of random steps in a local improvement process, etc. Second, in practical applications, randomized algorithms are often used in an interactive way. The algorithm is allowed to run until it finds a solution or the user decides not to wait any more, stops the execution, modifies the parameters of the algorithm or modifies the problem, and tries again. Finally, since randomized algorithms are not complete, they may make errors by not finding satisfying assignments when such assignments exist. Algorithms that are faster may be less accurate and the trade-off must be taken into consideration [Spe96].
It all points to the problems that arise when attempting to systematically study the running times of randomized algorithms and extrapolate their asymptotic behavior. In this paper, we define the concept of a running time relative to the accuracy. The relative running time is, intuitively, the time needed by a randomized algorithm to guarantee a postulated accuracy. We show in the paper that the relative running time is a useful performance measure for randomized satisfiability testing algorithms. In particular, we show that the running time of GSAT relative to a prescribed accuracy grows exponentially with the size of the problem.
Related work where the emphasis has been on fine tuning parameter settings [PW96, GW95] has shown somewhat different results in regard to the increase in time as the size of the problems grow. The growth shown by [PW96] is the retropective variation of maxflips rather than the total number of flips. The number of variables for the 3-CNF randomized instances reported [GW95] are . Although our results are also limited by the ability of complete algorithms to determine satisfiable instances, we have results for variable instances in the crossover region. The focus in our work is on maintaining accuracy as the size of the problems increase.
Second, we study the dependence of the accuracy and the relative running time on the number of satisfying assignments that the input formula admits. Intuitively, the more satisfying assignments the input formula has, the better the chance that a randomized algorithm finds one of them, and the shorter the time needed to do so. Again, our results quantify these intuitions. We show that the performance of GSAT increases exponentially with the growth in the number of satisfying assignments.
These results have interesting implications for the problem of constructing sets of test cases for experimenting with satisfiability algorithms. It is now commonly accepted that random -CNF formulas from the cross-over region are “difficult” from the point of view of deciding their satisfiability. Consequently, they are good candidates for testing satisfiability algorithms. These claims are based on the studies of the performance of the Davis-Putnam procedure. Indeed, on average, it takes the most time to decide satisfiability of CNF formulas randomly generated from the cross-over region. However, the suitability of formulas generated randomly from the cross-over region for the studies of the performance of randomized algorithms is less clear. Our results indicate that the performance of randomized algorithms critically depends on the number of satisfying assignments and much less on the density of the problem. Both under-constrained and over-constrained problems with a small number of satisfying assignments turn out to be hard for randomized algorithms. In the same time, Davis-Putnam procedure, while sensitive to the density, is quite robust with respect to the number of satisfying truth assignments.
On the other hand, there are classes of problems that are “easy” for Davis-Putnam procedure. For instance, Davis-Putnam procedure is very effective in finding 3-colorings of graphs from special classes such as 2-trees (see Section 4 for definitions). Thus, they are not appropriate benchmarks for Davis-Putnam type algorithms. However, a common intuition is that structured problems are “hard” for randomized algorithms [SKC96, SK93, SKC94]. In this paper we study this claim for the formulas that encode 3- and 4-coloring problem for 2-trees. We show that GSAT’s running time relative to a given accuracy grows exponentially with the size of a graph. This provides a formal evidence to the “hardness” claim for this class of problems and implies that, while not useful in the studies of complete algorithms such as Davis-Putnam method, they are excellent benchmarks for studying the performance of randomized algorithms.
The main contribution of our paper is not as much a discovery of an unexpected behavior of randomized algorithms for testing satisfiability as it is a proposed methodology for studying them. Our concepts of the accuracy and the relative running time allow us to quantify claims that are often accepted on the basis of intuitive arguments but have not been formally pinpointed.
In the paper, we apply our approach to the algorithm RWS-GSAT from [SK93, SKC94]. This algorithm is commonly regarded as one of the best randomized algorithms for satisfiability testing to date. For our experiments we used walksat version 35 downloaded from ftp.research.att.com/dist/ai and run on a SPARC Station 20.
2 Accuracy and running time
In this section, we will formally introduce the notion of the accuracy of a randomized algorithm . We will then define the concept of the running time relative to accuracy.
Let be a finite set of satisfiable CNF formulas and let
be a probability distribution defined on. Let be a sound algorithm (randomized or not) to test satisfiability. By the accuracy of (relative to the probability space ), we mean the probability that finds a satisfying assignment for a formula generated from according to the distribution . Clearly, the accuracy of complete algorithms (for all possible spaces of satisfiable formulas) is 1 and, intuitively, the higher the accuracy, the more “complete” is the algorithm for the space .
When studying and comparing randomized algorithms that are not complete, accuracy seems to be an important characteristics. It needs to be taken into account — in addition to the running time. Clearly, very fast algorithms that often return no satisfying assignments, even if they exist, are not satisfactory. In fact, most of the work on developing better randomized algorithms can be viewed as aimed at increasing the accuracy of these algorithms. Despite this, the accuracy is rarely explicitly mentioned and studied (see [Spe96, MSG97]).
We will propose now an approach through which the running times of randomized satisfiability testing algorithms can be compared. We will restrict our considerations to the class of randomized algorithms designed according to the following general pattern. These algorithms consist of a series of tries. In each try, a truth assignment is randomly generated. This truth assignment is then subject to a series of local improvement steps aimed at, eventually, reaching a satisfying assignment. The maximum number of tries the algorithm will attempt and the length of each try are the parameters of the algorithm. They are usually specified by the user. We will denote by the maximum number of tries and by — the maximum number of local improvement steps. Algorithms designed according to this pattern differ, besides possible differences in the values and , in the specific definition of the local improvement process. A class of algorithms of this structure is quite wide and contains, in particular, the GSAT family of algorithms, as well as algorithms based on the simulated annealing approach.
Let be a randomized algorithm falling into the class described above. Clearly, its average running time on instances from the space of satisfiable formulas depends, to a large degree, on the particular choices for and . To get an objective measure of the running time, independent of and , when defining time, we require that a postulated accuracy be met. Formally, let , , be a real number (a postulated accuracy). Define the running time of relative to accuracy , , to be the minimum time such that for some positive integers and , the algorithm with the maximum of tries and with the maximum of local improvement steps per try satisfies:
the average running time on instances from is at most , and
the accuracy of on is at least .
Intuitively, is the minimum expected time that guarantees accuracy . In Section 3
, we describe an experimental approach that can be used to estimate the relative running time.
The concepts of accuracy and accuracy relative to the running time open a number of important (and, undoubtedly, very difficult) theoretical problems. However, in this paper we will focus on an experimental study of accuracy and relative running time for a GSAT-type algorithm. These algorithms follow the following general pattern for the local improvement process. Given a truth assignment, GSAT selects a variable such that after its truth value is flipped (changed to the opposite one) the number of unsatisfied clauses is minimum. Then, the flip is actually made depending on the result of some additional (often again random) procedure.
In our experiments, we used two types of data sets. Data sets of the first type consist of randomly generated 3-CNF formulas [MSL92]. Data sets of the second type consist of CNF formulas encoding the -colorability problem for randomly generated 2-trees. These two classes of data sets, as well as the results of the experiments, are described in detail in the next two sections.
3 Random 3-CNF formulas
Consider a randomly generated 3-CNF formula , with variables and the ratio of clauses to variables equal to . Intuitively, when increases, the probability that is satisfiable should decrease. It is indeed so [MSL92]. What is more surprising, it switches from being close to one to being close to zero very abruptly in a very small range from approximately to . The set of 3-CNF formulas at the cross-over region will be denoted by . Implementations of the Davis-Putnam procedure take, on average, the most time on 3-CNF formulas generated (according to a uniform probability distribution) from the cross-over regions. Thus, these formulas are commonly regarded as good test cases for experimental studies of the performance of satisfiability algorithms [CA93, Fre96].
We used seven sets of satisfiable 3-CNF formulas generated from the cross-over regions , . These data sets are denoted by . Each data set was obtained by generating randomly 3-CNF formulas with variables and (for ) and (for ) clauses. For each formula, the Davis-Putnam algorithm was then used to decide its satisfiability. The first one thousand satisfiable formulas found in this way were chosen to form the data set.
The random algorithms are often used with much larger values of than we have reported in this paper. The importance of accuracy in this study required that we have only satisfiable formulas (otherwise, the accuracy cannot be reliably estimated). This limited the size of randomly generated 3-CNF formulas used in our study since we had to use a complete satisfiability testing procedure to discard those randomly generated formulas that were not satisfiable. In Section 5, we discuss ways in which hard test cases for randomized algorithms can be generated that are not subject to the size limitation.
For each data set , we determined values for , say and for use with RWS-GSAT, big enough to result in the accuracy at least 0.98. For instance, for , ranged from to , with the increment of 100, and ranged from 5 to 50, with the increment of 5. Next, for each combination of and , we ran RWS-GSAT on all formulas in and tabulated both the running time and the percentage of problems for which the satisfying assignment was found (this quantity was used as an estimate of the accuracy). These estimates and average running times for the data set are shown in the tables in Figure 1.
|MT||RWS-GSAT N=100 L=4.3 (time in seconds)|
|MT||RWS-GSAT N=100 L=4.3 (accuracy)|
Fixing a required accuracy, say at a level of , we then looked for the best time which resulted in this (or higher) accuracy. We used this time as an experimental estimate for . For instance, there are 12 entries in the accuracy table with accuracy or more. The lowest value from the corresponding entries in the running time table is 0.03 sec. and it is used as an estimate for .
The relative running times for RWS-GSAT run on the data sets , , and for and , are shown in Figure 2. Both graphs demonstrate exponential growth, with the running time increasing by the factor of 1.5 - 2 for every 50 additional variables in the input problems. Thus, while GSAT outperforms Davis-Putnam procedure for instances generated from the critical regions, if we prescribe the accuracy, it is still exponential and, thus, will quickly reach the limits of its applicability. We did not extend our results beyond formulas with up to 400 variables due to the limitations of the Davis-Putnam procedure, (or any other complete method to test satisfiability). For problems of this size, GSAT is still extremely effective (takes only about 2.5 seconds). Data sets used in Section 5 do not have this limitation (we know all formulas in these sets are satisfiable and there is no need to refer to complete satisfiability testing programs). The results presented there also illustrate the exponential growth of the relative running time and are consistent with those discussed here.
4 Number of satisfying assignments
It seems intuitive that accuracy and running time would be dependent on the number of possible satisfying assignments. Studies using randomly generated 3-CNF formulas [CFG96] and 3-CNF formulas generated randomly with parameters allowing the user to control the number of satisfiable solutions for each instance [CI95] show this correlation.
In the same way as for the data sets , we constructed data sets , where , and , . Each data set consists of 100 satisfiable 3-CNF formulas generated from the cross-over region and having more than and no more than satisfying assignments. Each data set was formed by randomly generating 3-CNF formulas from the cross-over region and by selecting the first 100 formulas with the number of satisfying assignments falling in the prescribed range (again, we used the Davis-Putnam procedure, here).
For each data set we ran the RWS-GSAT algorithm with and thus, allowing the same upper limits for the number of random steps for all data sets (these values resulted in the accuracy of .99 in our experiments with the data set discussed earlier). Figure 3 summarizes our findings. It shows that there is a strong relationship between accuracy and the number of possible satisfying assignments. Generally, instances with small number of solutions are much harder for RWS-GSAT than those with large numbers of solutions. Moreover, this observation is not affected by how constrained the input formulas are. We observed the same general behavior when we repeated the experiment for data sets of 3-CNF formulas generated from the under-constrained region (100 variables, 410 clauses) and over-constrained region (100 variables, 450 clauses), with under-constrained instances with few solutions being the hardest.
These results indicate that, when generating data sets for experimental studies of randomized algorithms, it is more important to ensure that they have few solutions rather than that they come from the critically constrained region.
5 CNF formulas encoding -colorability
To expand the scope of applicability of our results and argue their robustness, we also used in our study data sets consisting of CNF formulas encoding the -colorability problem for graphs. While easy for Davis-Putnam procedure (which resolves their satisfiability in polynomial time), formulas of this type are believed to be “hard” for randomized algorithms and were used in the past in the experimental studies of their performance. In particular, it was reported in [SK93] that RWS-GSAT does not perform well on such inputs (see also [JAMS91]).
Given a graph with the vertex set and the edge set , we construct the CNF formula as follows. First, we introduce new propositional variables , and . The variable expresses the fact that the vertex is colored with the color . Now, we define to consist of the following clauses:
, for every edge from ,
, for every vertex of ,
, for every vertex of and for every , .
It is easy to see that there is a one-to-one correspondence between -colorings of and satisfying assignments for . To generate formulas for experimenting with RWS-GSAT (and other satisfiability testing procedures) it is, then, enough to generate graphs and produce formulas .
In our experiments, we used formulas that encode -colorings for graphs known as -trees. The class of 2-trees is defined inductively as follows:
A complete graph on three vertices (a “triangle”) is a 2-tree
If is a 2-tree than a graph obtained by selecting an edge in , adding to a new vertex and joining to and is also a 2-tree.
A 2-tree with 6 vertices is shown in Fig. 4. The vertices of the original triangle are labeled 1, 2 and 3. The remaining vertices are labeled according to the order they were added.
The concept of 2-trees can be generalized to -trees, for an arbitrary . Graphs in these classes are important. They have bounded tree-width and, consequently, many NP-complete problems can be solved for them in polynomial time [AP89].
We can generate 2-trees randomly by simulating the definition given above and by selecting an edge for “expansion” randomly in the current 2-tree . We generated in this way families , for , each consisting of one hundred randomly generated 2-trees with vertices. Then, we created sets of CNF formulas , for . Each formula in a set has exactly 6 satisfying assignments (since each 2-tree has exactly 6 different 3-colorings). Thus, they are appropriate for testing the accuracy of RWS-GSAT.
Using CNF formulas of this type has an important benefit. Data sets can be prepared without the need to use complete (but very inefficient for large inputs) satisfiability testing procedures. By appropriately choosing the underlying graphs, we can guarantee the satisfiability of the resulting formulas and, often, we also have some control over the number of solutions (for instance, in the case of 3-colorability of 2-trees there are exactly 6 solutions).
We used the same methodology as the one described in the previous section to tabulate the accuracy and the running time of RSW-GSAT for a large range of choices for the parameters and . Based on these tables, as before, we computed estimates for the times for , for each of the data sets. The results that present the running time as a function of the number of vertices in a graph (which is of the same order as the number of variables in the corresponding CNF formula) are gathered in Figure 5. They show that RWS-GSAT’s performance deteriorates exponentially (time grows by the factor of for every 50 additional vertices).
An important question is: how to approach constraint satisfaction problems if they seem to be beyond the scope of applicability of randomized algorithms? A common approach is to relax some constraints. It often works because the resulting constraint sets (theories) are “easier” to satisfy (admit more satisfying assignments). We have already discussed the issue of the number of solutions in the previous section. Now, we will illustrate the effect of increasing the number of solutions (relaxing the constraints) in the case of the colorability problem. To this end, we will consider formulas from the spaces , representing 4-colorability of 2-trees. These formulas have exponentially many satisfying truth assignments (a 2-tree with vertices has exactly 4-colorings). For these formulas we also tabulated the times , for , as a function of the number of vertices in the graph. The results are shown in Figure 6.
Thus, despite the fact the size of a formula from is larger than the size of a formula from by the factor of , RWS-GSAT’s running times are much lower. In particular, within .5 seconds RWS-GSAT can find a 4-coloring of randomly generated 2-trees with 500 vertices. As demonstrated by Figure 5, RWS-GSAT would require thousands of seconds for 2-trees of this size to guarantee the same accuracy when finding 3-colorings. Thus, even a rather modest relaxation of constraints can increase the number of satisfying assignments substantially enough to lead to noticeable speed-ups. On the other hand, even though “easier”, the theories encoding the 4-colorability problem for 2-trees still are hard to solve by GSAT as the rate of growth of the relative running time is exponential (Fig. 6).
The results of this section further confirm and provide quantitative insights into our earlier claims about the exponential behavior of the relative running time for GSAT and on the dependence of the relative running time on the number of solutions. However, they also point out that by selecting a class of graphs (we selected the class of 2-trees here but there are, clearly, many other possibilities) and a graph problem (we focused on colorability but there are many other problems such as hamiltonicity, existence of vertex covers, cliques, etc.) then encoding these problems for graphs from the selected class yields a family of formulas that can be used in testing satisfiability algorithms. The main benefit of the approach is that by selecting a suitable class of graphs, we can guarantee satisfiability of the resulting formulas and can control the number of solutions, thus eliminating the need to resort to complete satisfiability procedures when preparing the test cases. We intend to further pursue this direction.
In the paper we formally stated the definitions of the accuracy of a randomized algorithm and of its running time relative to a prescribed accuracy. We showed that these notions enable objective studies and comparisons of the performance and quality of randomized algorithms. We applied our approach to study the RSW-GSAT algorithm. We showed that, given a prescribed accuracy, the running time of RWS-GSAT was exponential in the number of variables for several classes of randomly generated CNF formulas. We also showed that the accuracy (and, consequently, the running time relative to the accuracy) strongly depended on the number of satisfying assignments: the bigger this number, the easier was the problem for RWS-GSAT. This observation is independent of the “density” of the input formula. The results suggest that satisfiable CNF formulas with few satisfying assignments are hard for RWS-GSAT and should be used for comparisons and benchmarking. One such class of formulas, CNF encodings of the 3-colorability problem for 2-trees was described in the paper and used in our study of RWS-GSAT.
Exponential behavior of RWS-GSAT points to the limitations of randomized algorithms. However, our results indicating that input formulas with more solutions are “easier” for RWS-GSAT to deal with, explain RWS-GSAT’s success in solving some large practical problems. They can be made “easy” for RWS-GSAT by relaxing some of the constraints.
- [AP89] S. Arnborg and A. Proskurowski. Linear time algorithms for np-hard problems restricted to partial k-trees. Discrete Appl. Math., 23:11–24, 1989.
- [CA93] James M. Crawford and Larry D. Auton. Experimental results on the crossover point in satisfiability problems. In AAAI-93, 1993.
- [CFG96] Dave Clark, Jeremy Frank, Ian Gent, Ewan MacIntyre, Neven Tomov, and Toby Walsh. Local search and the number of solutions. In Proceeding of CP-96, 1996.
- [CI95] Cyungki Cha and Kazuo Iwama. Performance test of local search algorithms using new types of random cnf formulas. In Proceedings of IJCAI, 1995.
- [DABC96] O. Dubois, P. Andre, Y. Boufkhad, and J. Carlier. Sat versus unsat. DIMACS Cliques, Coloring and Satisfiability, 26, 1996.
- [DP60] M. Davis and H. Putnam. A computing procedure for quantification theory. Journal of Association for Computing Machines, 7, 1960.
- [Fre96] Jon W. Freeman. Hard random 3-sat problems and the davis-putnam procedure. Artificial Intelligence, 81, 1996.
- [GW95] Ian P. Gent and Toby Walsh. Unsatisfied variables in local search. In Proceedings of AISB-95, 1995.
- [JAMS91] David Johnson, Cecilia Aragon, Lyle McGeoch, and Catherine Schevon. Optimization by simulated annealing: An experimental evaluation; part ii, graph coloring and number partitioning. Operations Research, 39(3), May-June 1991.
- [KS92] Henry A. Kautz and Bart Selman. Planning as satisfiability. In Proceedings of the 10th European Conference on Artificial Intelligence, Vienna, Austria, 1992.
- [MSG97] Bertrand Mazure, Lakhdar Saís, and Éric Grégoire. Tabu search for sat. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97). MIT Press, July 1997.
- [MSL92] David Mitchell, Bart Selman, and Hector Levesque. Hard and easy distributions of sat problems. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), July 1992.
- [PW96] Andrew J. Parkes and Joachim P. Walser. Tuning local search for satisfiability testing. In Proceeding of the Thirteen National Conference on Artificial Intelligence(AAAI-96), pages 356–362, 1996.
- [SK93] Bart Selman and Henry A. Kautz. Domain-independent extensions to gsat: Solving large structured satisfiability problems. In Proceedings of IJCAI-93, 1993.
- [SKC94] Bart Selman, Henry A. Kautz, and Bram Cohen. Noise strategies for improving local search. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), 1994.
- [SKC96] Bart Selman, Henry A. Kautz, and Bram Cohen. Local search stragies for satisfiability. DIMACS Cliques, Coloring and Satisfiability, 26, 1996.
- [SLM92] Bart Selman, Hector Levesque, and David Mitchell. A new method for solving hard satisfiability problems. In Proccedings of the Tenth National Conference on Artificial Intelligence(AAAI-92), July 1992.
- [Spe96] William M. Spears. Simulated annealing for hard satisfiability problems. DIMACS Cliques, Coloring and Satisfiability, 26, 1996.