In a recent article , Oh examined the impact of various key heuristics in competitive SAT solvers. His key findings are that the average success of those heuristics depends on whether the input formula is satisfiable or not. In particular the effect of the deletion strategy, restart policy, decay factor, and database reduction is different, on average, between satisfiable and unsatisfiable formulas. This observation can be used for designing solvers that specialize in one of them, and for designing a hybrid solver that alternates between SAT / UNSAT ‘modes’. Indeed certain variants of COMiniSatPS  work this way.
We do not see an a priory reason to believe that the SAT/UNSAT divide—corresponding to the distinction between zero or more solutions—explains best the differences in the effect of the various heuristics.111While proving Unsat and Sat belong to separate complexity classes, there is no known connection of this fact to effectiveness of heuristics. In this work we investigate further his findings, and show empirically that there are more refined measures (properties) than the satisfiability of the formula, that predict better the effectiveness of these heuristics. In particular, we checked how it correlates with two measures of satisfiable formulas: the entropy of the formula (to be defined below), which approximates the freedom we have in assigning the variables, and the solution density (henceforth density), which is the number of solutions divided by the search space. Our experiments show that both are strongly correlated to the effectiveness of the heuristics, but the entropy measure seems to be a better predictor. Generally our findings confirm Oh’s observations regarding which heuristic works better with satisfiable formulas. But we also found that satisfiable formulas with small entropy ‘behave’ similarly to unsatisfiable formulas.
Let be a propositional CNF formula, its set of variables and its set of literals. In the following we will use to denote the literals corresponding to a variable when the distinction between variables and literals is clear from the context. If is satisfiable, we denote by , for , the ratio of solutions to that satisfy . Hence for all , it holds that . We now define:
Definition 1 (variable entropy)
For a satisfiable formula , the entropy of a variable is defined by
where is taken as being equal to 0.
Intuitively, entropy reflects how ‘balanced’ a variable is with respect to the solution space of the formula. In particular when or , which means that or , respectively. In other words, implies that is a backbone variable, since its value is implied by the formula. The other extreme is ; this happens when , which means that and appear an equal number of times in the solution space.
Definition 2 (formula entropy)
The entropy of a satisfiable formula is the average entropy of its variables.
As an example, Fig. 1 (right) is a histogram of for a particular formula , where for 24 out of the 100 variables .
2.0.1 Entropy is hard to compute
: Let denote the number of solutions a formula has. Then it is easy to see that
Hence computing amounts to two calls to a model counter. But since the denominator is fixed for , computing amounts to calls to a model counter. Since model counting is a #P problem, we can only compute this value for small formulas.
2.0.2 The benchmark set
: Using the model-counter Cachet 
, we computed the precise entropy of 5000 3-SAT random formulas with 100 variables and 400 clauses. These are formulas taken from SAT-lib, in which the number of backbone variables is known. Specifically, there is an equal number of formulas in this set with 10,30,50,70 and 90 backbone variables (i.e., a 1000 formulas of each number of backbone variables), which gave us a near-uniform distribution of entropy among the formulas.
3 A preliminary: standardized linear regression
We assume the reader is somewhat familiar with linear regression. It is a standard technique for building a linear model, where in our case is a predictor of the number of conflicts, and is either the entropy or the density of the formula. We will focus on two results of linear regression: the value of and the -value. The latter is computed with respect to a null hypothesis, denoted , that , and an alternative hypothesis . can be either the complement of () or a ‘one-sided hypothesis’, e.g., . In the former case, , where and . The ‘0’ in the numerator comes from the specific value in . In other words, assuming is correct, the , is less than , the standardized value of . In the latter case .
We list below several important points about the analysis that we applied.
Standardization of the data: given data points , their standardization is defied for by
where is the average value of and
is its standard deviation. Nowhas no units, and hence two standardized sets of data are comparable even if they originated from different types of measures (in our case, entropy and density). All the data in our experiments was standardized.
Bootstrapping: Bootstrapping, parameterized by a value
, is a well-known technique for improving the precision of various statistics, such as the confidence interval. Technically, bootstrap is applied as follows: Given the originalsamples, uniformly sample it times with replacement (i.e., without taking the sampled points out, which implies that the same point can be selected more than once); repeat this process times. Hence we now have data points. For our experiments we took , which is a rather standard value when using this technique. Hence, we have data points.
Two regression tests: The entropy and density data consists of pairs of the form , and , respectively, where is the index of the heuristic. Hence the corresponding data is four series of points , and , where . In order to compare the predictive power of entropy, density and Oh’s criterion of SAT/UNSAT, we performed two statistical tests (recall that the data is standardized, and hence comparable):
The test: A linear regression test over the series , and the series .
The test: A linear regression test over the series and , and similarly for density (i.e., four tests all together). We then checked the significance of for each of these 4 tests (in all such tests the significance was clear). In addition, we checked the hypothesis for each of the measures. The result of this last test is what we will list in the results table in Appendix 0.B.
Intuitively, the two models tell us slightly different things: the first tells us whether the gap between the two heuristics is correlated with the measure, and the second tells us whether there is a significant difference in the value of (the slope of the linear model) between the two heuristics. As we will see in the results, the -value obtained by these models can be very different.
Plots: The plots are based on the original (non-standardized) data. To reduce the clutter (from 5000 points), we rounded all values to 2 decimal points and then aggregated them. Aggregation means that points (i.e., points with an equal value) are replaced with a single point ( ). However the trend-lines in the various plots are depicted according to the original data, before rounding and aggregation. The statistical significance of these trend-lines appears in Appendix 0.B.
4 Entropy and density predict hardness
We checked the correlation between hardness, as measured by the number of conflicts, and the two measures described above, namely entropy and density. We use the number of conflicts as a proxy of the run-time, because these are all easy formulas for SAT, and hence the differences in run-time are rather meaningless. The two plots in Fig. 2 depict this data based on our experiments with the solver MiniSat-HACK-999ED
. It is apparent that higher entropy and higher density imply a smaller number of conflicts. A detailed regression analysis appears in Appendix0.A, for seven solvers.
We also checked the correlation between the two measures themselves: perhaps formulas with higher entropy also have a higher density (each variable with high entropy, e.g., , nearly doubles the number of solutions). It turns out that in our benchmarks these two measures are not correlated: the confidence-interval for is [0.144–0.156] with a -value which is practically 0.
5 Empirical findings
In this section we describe each of the experiments of Oh , and our own version of the experiment based on entropy and density, when applied to the benchmarks mentioned above. We omit the details of one experiment, in which Oh examined the effect of canceling database reduction, the reason being that this heuristic is only activated after 2000 conflicts, and most of our benchmarks are solved before that point.222Our attempt to use an approximate model-counter with larger formulas failed: the inaccuracies were large enough to make the analysis show results that are senseless. Raw data as well as charts and regression analysis of our full set of experiments can be found online in .
1. Deletion strategy: Different solvers use different criteria for selecting the learned clauses for deletion. It was shown in  that for SAT instances learned clauses with low Literal Block Distance (LBD)  value can help, whereas others have no apparent effect. In one of the experiments, whose results are copied here at the top part of Fig. 3, Oh compared the criterion of ‘core LBD-cut’333An LBD-cut is the lowest value of LBD a learned clause had so far, assuming this value is recalculated periodically. 5 and clause size 12. In other words, either save (i.e., do not delete) clauses with an LBD-cut of 5 and lower, or clauses with size 12 or lower. It shows that for UNSAT instances the former is better, whereas the opposite conclusion is reached for the SAT instances. The results of our own experiments are depicted at the bottom of the figure. They show that the latter is indeed slightly better with our benchmarks (all satisfiable, recall). But what is more important, is that the difference becomes smaller with lower entropy—hence the decline of the trend-line (recall that the trend-lines are based on the raw data, whereas the diagram itself is computed after rounding and aggregation to improve visibility). Hence it is evident that formulas with small entropy ‘behave’ more similar to unsat formulas. The ascending trend-line in the right figure shows, surprisingly, an opposite effect of density.
2. Deletion with different LBD-cut value Related to the previous heuristic, in  it was found that deletion based on larger LBD-cut values, up to a point, improve the performance of the solver with unsat formulas, but not with SAT ones. Fig. 4 (top) is an excerpt from his results for various LBD-cut values. We repeated his experiment with LBD-cut 1 and LBD-cut 5. The plots show that lower values of entropy and (independently) lower values of density yield a bigger advantage to LBD-cut 5, which again demonstrates that satisfiable formulas with these values ‘behave’ similarly to unsat formulas.
3. Restarts policy: The Luby restart strategy  is based on a fixed sequence of time intervals, whereas the Glucose restarts are more rapid and dynamic. It initiates a restart when the solver identifies that learned clauses have higher LBD than average. According to the competitions’ results this is generally better in unsat instances. Oh confirmed the hypothesis that this is related to the restart strategy: indeed his results show that for satisfiable instances Luby restart is better.
Our own results can be seen in Fig. 5 and in Appendix 0.B. The fact that the gap in the number of conflicts between Luby and Glucose-style restarts is negative, implies that the former is generally better, which is consistent with Oh’s results for satisfiable formulas. Observe that the trend-line slightly declines with entropy (), which implies that Glucose restarts are slightly better with low entropy. So again we observe that low entropy formulas ‘behave’ more similar to UNSAT formulas than those that have high entropy. The table in Appendix 0.B shows that this result has a relatively high -value. We speculate that with high-entropy instances, the solver hits more branches that can be extended to a solution, hence Glucose’s rapid restarts can be detrimental. Density seems to have an opposite effect, although again only with low statistical confidence.
4. The variable decay factor: The well-known VSIDS branching heuristic is based on an activity score of literals, which decay over time, hence giving higher priority to literals that appear in recently-learned clauses. In the solver MiniSat_HACK_999ED, there is a different decay factor for each of the two restart phases: this solver alternates between a Glucose-style (G) restart policy phase and a no-restart (NR) phase (these two phases correspond to good heuristics for SAT and UNSAT formulas, respectively). In  Oh compares different decay factors for each of these restart phases, on top of MiniSat_HACK_999ED. His results show that for UNSAT instances slower decay gives better performance, while for SAT instances it is unclear. His results appear at the top of Fig. 6. We experimented with the two extreme decay factors in that table: 0.95 and 0.6. Note that since our benchmarks are relatively easy, the solver never reaches the NR phase. The plot at the bottom of the figure shows the gap in the number of conflicts between these two values. A higher value means that with strong decay (0.6) the results are worse. We can see that the results are worse with strong decay when the entropy is low, which demonstrates again that the effect of the variable decay factor is similar for unsat formulas and satisfiable formulas with low entropy. A similar phenomenon happens with small density.
: We defined the entropy property of satisfiable formulas, and used it, together with solution density, to further investigate the results achieved by Oh in . We showed that both are strongly correlated with the difficulty of solving the formula (as measured by the number of conflicts). Furthermore, we showed that they predict better the effect of various SAT heuristics than Oh’s sat/unsat divide, and that satisfiable formulas with small entropy ‘behave’ similarly to unsatisfiable formulas. Since both measures are hard to compute we do not expect these results to be applied directly (e.g., in a portfolio), but perhaps future research will find ways to cheaply approximate them. For example, a high backbone count (variables with a value at decision level 0) may be correlated to low entropy, because such variables contribute 0 to the formula’s entropy.
We thank Dr. David Azriel for his guidance regarding statistical techniques.
-  Full experimental results. http://ie.technion.ac.il/~ofers/entropy/supp.zip.
G. Audemard and L. Simon.
Predicting learnt clauses quality in modern SAT solvers.
In C. Boutilier, editor,
IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009, pages 399–404, 2009.
-  M. Luby, A. Sinclair, and D. Zuckerman. Optimal speedup of las vegas algorithms. Inf. Process. Lett., 47(4):173–180, 1993.
-  C. Oh. Between SAT and UNSAT: the fundamental difference in CDCL SAT. In SAT, volume 9340 of LNCS, pages 307–323, 2015.
-  T. Sang, F. Bacchus, P. Beame, H. A. Kautz, and T. Pitassi. Combining component caching and clause learning for effective model counting. In SAT, 2004.
-  C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948.
Appendix 0.A Predicting hardness: a regression analysis
Denote by and the -value of the linear models for entropy vs. conflicts and density vs. conflicts, respectively. The table below shows strong correlation between both measures to the number of conflicts (the -value in both cases, for all engines, is practically 0). The Last two columns show the gap and the corresponding -value for , when measured across the iterations of the bootstrap method that was described in Sec. 3. For engines with high -value we cannot reject with confidence.
|MiniSat-HACK-999ED||(-84.29, -72.58 )||(-84.93, -73.56 )||( 5.37, 16.96 )||0.716|
|(modified to luby)||(-86.31, -75.36 )||(-82.97, -72.64 )||(-7.51, 1.44 )||0.200|
|(modified for 2 phases)||(-72.84, -63.61 )||(-72.31, -62.91 )||(-4.80, 3.57 )||0.738|
|SWDiA5BY||(-91.61, -79.17 )||(-90.97, -78.77 )||(-5.95, 4.92 )||0.84|
|COMiniSatPS||(-74.68, -64.58 )||(-75.41, -65.43 )||(-3.79, 5.37 )||0.76|
|lingeling-ayv||(-76.19, -66.61 )||(-71.70, -61.76 )||(-8.99, -0.35 )||0.029|
|Glucose||(-91.24, -79.34 )||(-90.56, -78.88 )||(-6.00, 4.85 )||0.845|
Appendix 0.B Regression-tests results
The table below lists the confidence interval and corresponding -value, for the two regression tests and (in the latter we also list the results for ) that were explained in Sec. 3, and the four experiments described in Sec. 5. is one-sided.
|Exp.||Measure||Conf. interval||-val||Conf. interval||-val||Conf. interval||-val|
|1||Entropy||(-2.76, 2.64 )||0.48||(-2.75, 2.46)||0.05||(-12.06, -6.60)||0|
|Density||(-0.81, 4.35 )||0.09||(-0.83, 4.43)||0.39||(-12.06, -6.64)||0|
|2||Entropy||(-3.72, 0.25 )||0.04||(-3.78, 0.25 )||0.39||(0.48, 4.61 )||0.01|
|Density||(-3.40, 0.59 )||0.09||(-3.34, 0.69 )||0.47||(0.47, 4.56 )||0.01|
|3||Entropy||(-8.31, 3.52 )||0.22||(-8.01, 3.67 )||0.001||(-36.12, -23.78 )||0|
|Density||(-4.41, 7.36 )||0.30||(-4.34, 7.50 )||0.05||(-35.99, -23.90 )||0|
|4||Entropy||(-15.1, -10.6 )||0||(-15.1, -10.7 )||0.125||(20.99, 25.44 )||0|
|Density||(-3.92, 0.60 )||0||(-13.60, -8.86 )||0.475||(20.96, 25.47 )||0|