Scavenger 0.1: A Theorem Prover Based on Conflict Resolution

04/11/2017 ∙ by Daniyar Itegulov, et al. ∙ Australian National University 0

This paper introduces Scavenger, the first theorem prover for pure first-order logic without equality based on the new conflict resolution calculus. Conflict resolution has a restricted resolution inference rule that resembles (a first-order generalization of) unit propagation as well as a rule for assuming decision literals and a rule for deriving new clauses by (a first-order generalization of) conflict-driven clause learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The outstanding efficiency of current propositional Sat-solvers naturally raises the question of whether it would be possible to employ similar ideas for automating first-order logical reasoning. The recent Conflict Resolution calculus111Not to be confused with the homonymous calculus for linear rational inequalities [KorovinTVConflictResolution]. () [ConflictResolution] can be regarded as a crucial initial step to answer this question. From a proof-theoretical perspective, generalizes (to first-order logic) the two main mechanisms on which modern Sat-solvers are based: unit propagation and conflict-driven clause learning. The calculus is sound and refutationally complete, and derivations are isomorphic to implication graphs.

This paper goes one step further by defining proof search algorithms for . Familiarity with the propositional CDCL procedure [CDCL] is assumed, even though it is briefly sketched in Section 2. The main challenge in lifting this procedure to first-order logic is that, unlike in propositional logic, first-order unit propagation does not always terminate and true clauses do not necessarily have uniformly true literals (cf. Section 4). Our solutions to these challenges are discussed in Section 5 and Section 6, and experimental results are presented in Section 7.

Related Work:

’s unit-propagating resolution rule can be traced back to unit-resulting resolution [URR]. Other attempts to lift DPLL [DavisPutnam, DLL] or CDCL [CDCL] to first-order logic include Model Evolution [BaumgartnerFODPLL, BaumgartnerTinelli, Baumgartner, BaumgartnerTinelliLemmaLearning], Geometric Resolution [GeometricResolution], Non-Redundant Clause Learning [AlagiWeidenbach] and the Semantically-Guided Goal Sensitive procedure [BonacinaPlaisted1, BonacinaPlaisted2, BonacinaPlaisted3, BonacinaPlaisted4]. A brief summary of these approaches and a comparison with can be found in [ConflictResolution]. Furthermore, many architectures [Equinox, iProver, InstGen, AVATAR, Satallax] for first-order and higher-order theorem proving use a Sat-solver as a black box for propositional reasoning, without attempting to lift it; and Semantic Resolution [SCOtt1, SCOtt2] is yet another related approach that uses externally built first-order models to guide resolution.

2 Propositional CDCL

During search in the propositional case, a Sat-solver keeps a model (a.k.a. trail) consisting of a (conjunctive) list of decision literals and propagated literals. Literals of unit clauses are automatically added to the trail, and whenever a clause has only one literal that is not falsified by the current model, this literal is added to the model (thereby satisfying that clause). This process is known as unit-propagation. If unit propagation reaches a conflict (i.e. a situation where the dual of a literal already contained in the model would have to be added to it), the Sat-solver backtracks, removing from the model decision literals responsible for the conflict (as well as propagated literals entailed by the removed decision literals) and deriving, or learning, a conflict-driven clause consisting222In practice, optimizations (e.g. 1UIP) are used, and more sophisticated clauses, which are not just disjunctions of duals of the decision literals involved in the conflict, can be derived. But these optimizations are inessential to the focus of this paper. of duals of the decision literals responsible for the conflict (or the empty clause, if there were no decision literals). If unit propagation terminates without reaching a conflict and all clauses are satisfied by the model, then the input clause set is satisfiable. If some clauses are still not satisfied, the Sat-solver chooses and assigns another decision literal, adding it to the trail, and satisfying the clauses that contain it.

3 Conflict Resolution

The inference rules of the conflict resolution calculus are shown in Figure 1. The unit propagating resolution rule is a chain of restricted resolutions with unit clauses as left premises and a unit clause as final conclusion. Decision literals are denoted by square brackets, and the conflict-driven clause learning rule infers a new clause consisting of negations of instances of decision literals used to reach a conflict (a.k.a. the empty clause ). A clause learning inference is said to discharge the decision literals that it uses. As in the resolution calculus, derivations are directed acyclic graphs that are not necessarily tree-like. A refutation is a derivation of with no undischarged decision literals.

From a natural deduction point of view, a unit propagating resolution rule can be regarded as a chain of implication eliminations taking unification into account, whereas decision literals and conflict driven clause learning are reminiscent of, respectively, assumptions and chains of negation introductions, also generalized to first-order through unification. Therefore, can be considered a first-order hybrid of resolution and natural deduction.

Unit-Propagating Resolution:

where is a unifier of and , for all .


where is a unifier of and .

Conflict-Driven Clause Learning:

where (for and ) is the
composition of all substitutions used on the -th path333Since a proof DAG is not necessarily tree-like, there may be more than one path connecting to in the DAG-like proof. from to .

Figure 1: The Conflict Resolution Calculus

4 Lifting Challenges

First-order logic presents many new challenges for methods based on propagation and decisions, of which the following can be singled out:

(1) non-termination of unit-propagation:

In first-order logic, unit propagation may never terminate. For example, the clause set is clearly unsatisfiable, because there is no assignment of and to true or false that would satisfy all the last four clauses. However, unit propagation would derive the following infinite sequence of units, by successively resolving with previously derived units, starting with : . Consequently, a proof search strategy that would wait for unit propagation to terminate before making decisions would never be able to conclude that the given clause set is unsatisfiable.

(2) absence of uniformly true literals in satisfied clauses:

While in the propositional case, a clause that is true in a model always has at least one literal that is true in that model, this is not so in first-order logic, because shared variables create dependencies between literals. For instance, the clause set is satisfiable, but there is no model where is uniformly true (i.e. true for all instances of ) or is uniformly true.

(3) propagation without satisfaction:

In the propositional case, when only one literal of a clause is not false in the model, this literal is propagated and added to the model, and the clause necessarily becomes true in the model and does not need to be considered in propagation anymore, at least until backtracking. In the first-order case, on the other hand, a clause such as would propagate the literal in a model containing , but does not become true in a model where is true. It must remain available for further propagations. If, for instance, the literal is added to the model, the clause will be used again to propagate .

(4) quasi-falsification without propagation:

A clause is quasi-falsified by a model iff all but one of its literals are false in the model. In first-order logic, in contrast to propositional logic, it is not even the case that a clause will necessarily propagate a literal when only one of its literals is not false in the model. For instance, the clause is quasi-falsified in a model containing and , but no instance of can be propagated.

The first two challenges affect search in a conceptual level, and solutions are discussed in Section 5. The last two prevent a direct first-order generalization of the data structures (e.g. watched literals) that make unit propagation so efficient in the propositional case. Partial solutions are discussed in Section 6.

5 First-Order Model Construction and Proof Search

Despite the fundamental differences between propositional and first-order logic described in the previous section, the first-order algorithms presented aim to adhere as much as possible to the propositional procedure sketched in the Section 2. As in the propositional case, the model under construction is a (conjunctive) list of literals, but literals may now contain (universal) variables. If a literal is in a model , then any instance is said to be true in . Note that checking that a literal is true in a model is more expensive in first-order logic than in propositional logic: whereas in the latter it suffices to check that is in , in the former it is necessary to find a literal in and a substitution such that . A literal is said to be strongly true in a model iff is in .

There is a straightforward solution for the second challenge (i.e. the absence of uniformly true literals in satisfied clauses): a clause is satisfied by a model iff all its relevant instances have a literal that is true in , where an instance is said to be relevant if it substitutes the clause’s variables by terms that occur in . Thus, for instance, the clause is satisfied by the model , because both relevant instances and have literals that are true in the model. However, this solution is costly, because it requires the generation of many instances. Fortunately, in many (though not all) cases, a satisfied clause will have a literal that is true in , in which case the clause is said to be uniformly satisfied. Uniform satisfaction is cheaper to check than satisfaction. However, a drawback of uniform satisfaction is that the model construction algorithm may repeatedly attempt to satisfy a clause that is not uniformly satisfied, by choosing one of its literals as a decision literal. For example, the clause is not uniformly satisfied by the model . Without knowing that this clause is already satisfied by the model, the procedure would try to choose either or as decision literal. But both choices are useless decisions, because they would lead to conflicts with conflict-driven clauses equal to a previously derived clause or to a unit clause containing a literal that is part of the current model. A clause is said to be weakly satisfied by a model if and only if all its literals are useless decisions.

Because of the first challenge (i.e. the non-termination of unit-propagation in the general first-order case), it is crucial to make decisions during unit propagation. In the example given in item 1 of Section 4, for instance, deciding

at any moment would allow the propagation of

and (respectively due to the 4th and 6th clauses), triggering a conflict. The learned clause would be and it would again trigger a conflict by the propagation of and (this time due to the 3rd and 5th clauses). As this last conflict does not depend on any decision literal, the empty clause is derived and thus the clause set is refuted. The question is how to interleave decisions and propagations. One straightforward approach is to keep track of the propagation depth444Because of the isomorphism between implication graphs and subderivations in Conflict Resolution [ConflictResolution], the propagation depth is equal to the corresponding subderivation’s height, where initial axiom clauses and learned clauses have height and the height of the conclusion of a unit-propagating resolution inference is where is the maximum height of its unit premises. in the implication graph: any decision literal or literal propagated by a unit clause has propagation depth ; any other literal has propagation depth , where is the maximum propagation depth of its predecessors. Then propagation is performed exhaustively only up to a propagation depth threshold . A decision literal is then chosen and the threshold is incremented. Such eager decisions guarantee that a decision will eventually be made, even if there is an infinite propagation path. However, eager decisions may also lead to spurious conflicts generating useless conflict-driven clauses. For instance, the clause set (where clauses have been numbered for easier reference) is unsatisfiable, because a conflict with no decisions can be obtained by propagating (by 1), and then , , …, , (by 2, repeatedly), which conflicts with (by 3). But the former propagation has depth . If the propagation depth threshold is lower than , a decision literal is chosen before that conflict is reached. If is chosen, for example, in an attempt to satisfy the sixth clause, there are propagations (using and clauses 1, 4, 5 and 6) with depth lower than the threshold and reaching a conflict that generates the clause , which is useless for showing unsatisfiability of the whole clause set. This is not a serious issue, because useless clauses are often generated in conflicts with non-eager decisions as well. Nevertheless, this example suggests that the starting threshold and the strategy for increasing the threshold have to be chosen wisely, since the performance may be sensitive to this choice.

Interestingly, the problem of non-terminating propagation does not manifest in fragments of first-order logic where infinite unit propagation paths are impossible. A well-known and large fragment is the effectively propositional (a.k.a. Bernays-Schönfinkel) class, consisting of sentences with prenex forms that have an quantifier prefix and no function symbols. For this fragment, a simpler proof search strategy that only makes decisions when unit propagation terminates, as in the propositional case, suffices. Infinite unit propagation paths do not occur in the effectively propositional fragment because there are no function symbols and hence the term depth555The depth of constants and variables is zero and the depth of a complex term is when is the maximum depth of its proper subterms. does not increase arbitrarily. Whenever the term depth is bounded, infinite unit propagation paths cannot occur, because there are only finitely many literals with bounded term depth (given the finite set of constant, function and predicate symbols with finite arity occurring in the clause set).

The insight that term depth is important naturally suggests a different approach for the general first-order case: instead of limiting the propagation depth, limit the term depth instead, allowing arbitrarily long propagations as long as the term depth of the propagated literals are smaller than the current term depth threshold. A literal is propagated only if its term depth is smaller than the threshold. New decisions are chosen when the term-depth-bounded propagation terminates and there are still clauses that are not uniformly satisfied. As before, eager decisions may lead to spurious conflicts, but bounding propagation by term depth seems intuitively more sensible than bounding it by propagation depth.

6 Implementation Details

is implemented in Scala and its source code and usage instructions are available in Its packrat combinator parsers are able to parse TPTP CNF files [TPTP]. Although is a first-order prover, every logical expression is converted to a simply typed lambda expression, implemented by the abstract class E with concrete subclasses Sym, App and Abs for, respectively, symbols, applications and abstractions. A trait Var is used to distinguish variables from other symbols. Scala’s case classes are used to make E behave like an algebraic datatype with (pattern-matchable) constructors. The choice of simply typed lambda expressions is motivated by the intention to generalize to multi-sorted first-order logic and higher-order logic and support TPTP TFF and THF in the future. Every clause is internally represented as an immutable two-sided sequent consisting of a set of positive literals (succedent) and a set of negative literals (antecedent).

When a problem is unsatisfiable, can output a refutation internally represented as a collection of ProofNode objects, which can be instances of the following immutable classes: UnitPropagatingResolution, Conflict, ConflictDrivenClauseLearning, Axiom, Decision. The first three classes correspond directly to the rules shown in Figure 1. Axiom is used for leaf nodes containing input clauses, and Decision represents a fictive rule holding decision literals. Each class is responsible for checking, typically through require statements, the soundness conditions of its corresponding inference rule. The Axiom, Decision and ConflictDrivenClauseLearning classes are less than 5 lines of code each. Conflict and UnitPropagatingResolution are respectively 15 and 35 lines of code. The code for analyzing conflicts, traversing the subderivations (conflict graphs) and finding decisions that contributed to the conflict, is implemented in a superclass, and is 17 lines long.

The following three variants of were implemented:

  • EP-: aiming at the effectively propositional fragment, propagation is not bounded, and decisions are made only when propagation terminates.

  • PD-: Propagation is bounded by a propagation depth threshold starting at . Input clauses are assigned depth . Derived clauses and propagated literals obtained while the depth threshold is are assigned depth . The threshold is incremented whenever every input clause that is neither uniformly satisfied nor weakly satisfied is used to derive a new clause or to propagate a new literal. If this is not the case, a decision literal is chosen (and assigned depth ) to uniformly satisfy one of the clauses that is neither uniformly satisfied nor weakly satisfied.

  • TD-: Propagation is bounded by a term depth threshold starting at

    . When propagation terminates, a stochastic choice between either selecting a decision literal or incrementing the threshold is made with probability of 50% for each option. Only uniform satisfaction of clauses is checked.

The third and fourth challenges discussed in Section 4 are critical for performance, because they prevent a direct first-order generalization of data structures such as watched literals, which enables efficient detection of clauses that are ready to propagate literals. Without knowing exactly which clauses are ready to propagate, (in its three variants) loops through all clauses with the goal of using them for propagation. However, actually trying to use a given clause for propagation is costly. In order to avoid this cost, performs two quicker tests. Firstly, it checks whether the clause is uniformly satisfied (by checking whether one of its literals belongs to the model). If it is, then the clause is dismissed. This is an imperfect test, however. Occasionally, some satisfied clauses will not be dismissed, because (in first-order logic) not all satisfied clauses are uniformly satisfied. Secondly, for every literal of every clause, keeps a set of decision literals and propagated literals that are unifiable with . A clause is quasi-falsified when at most one literal of has an empty set associated with it. This is a rough analogue of watched literals for detecting quasi-falsified clauses. Again, this is an imperfect test, because (in first-order logic) not all quasi-falsified clauses are ready to propagate. Despite the imperfections of these tests, they do reduce the number of clauses that need to be considered for propagation, and they are quick and simple to implement.

Overall, the three variants of listed above have been implemented concisely. Their main classes are only 168, 342 and 176 lines long, respectively, and no attempt has been made to increase efficiency at the expense of the code’s readability and pedagogical value. Premature optimization would be inappropriate for a first proof-of-concept.

still has no sophisticated backtracking and restarting mechanism, as propositional Sat-solvers do. When reaches a conflict, it restarts almost completely: all derived conflict-driven clauses are kept, but the model under construction is reset to the empty model.

7 Experiments

Experiments were conducted666Raw experimental data are available at in the StarExec cluster [StarExec] to evaluate ’s performance on TPTP v6.4.0 benchmarks in CNF form and without equality. For comparison, all other 21 provers available in StarExec’s TPTP community and suitable for CNF problems without equality were evaluated as well. For each job pair, the timeouts were 300 CPU seconds and 600 Wallclock seconds.

Problems Solved
Prover EPR All
PEPR-0.0ps 432 432
GrAnDe-1.1 447 447
Paradox-3.0 467 506
ZenonModulo-0.4.1 315 628
TD-Scavenger 350 695
PD-Scavenger 252 782
Geo-III-2016C 344 840
EP-Scavenger 349 891
Metis-2.3 404 950
Z3-4.4.1 507 1027
Zipperpin-FOF-0.4 400 1029
Otter-3.3 362 1068
Problems Solved
Prover EPR All
Bliksem-1.12 424 1107
SOS-2.0 351 1129
CVC4-FOF-1.5.1 452 1145
SNARK-20120808 417 1150
Beagle-0.9.47 402 1153
E-Darwin-1.5 453 1213
Prover9-1109a 403 1293
Darwin-1.4.5 508 1357
iProver-2.5 551 1437
ET-0.2 486 1455
E-2.0 489 1464
Vampire-4.1 540 1524
Table 1: Number of problems solved by each prover

Table 1 shows how many of the 1606 unsatisfiable CNF problems and 572 effectively propositional (EPR) unsatisfiable CNF problems each theorem prover solved; and figures 2 and 3 shows the performance in more detail. For a first implementation, the best variants of show an acceptable performance. All variants of outperformed PEPR, GrAnDe, DarwinFM, Paradox and ZenonModulo; and EP-additionally outperformed Geo-III. On the effectively propositional propblems, TD-outperformed LEO-II, ZenonModulo and Geo-III, and solved only 1 problem less than SOS-2.0 and 12 less than Otter-3.3. Although Otter-3.3 has long ceased to be a state-of-the-art prover and has been replaced by Prover9, the fact that solves almost as many problems as Otter-3.3 is encouraging, because Otter-3.3 is a mature prover with 15 years of development, implementing (in the C language) several refinements of proof search for resolution and paramodulation (e.g. orderings, set of support, splitting, demodulation, subsumption) [Otter, OtterManual], whereas is a yet unrefined and concise implementation (in Scala) of a comparatively straightforward search strategy for proofs in the Conflict Resolution calculus, completed in slightly more than 3 months. Conceptually, Geo-III (based on Geometric Resolution) and Darwin (based on Model Evolution) are the most similar to . While already outperforms Geo-III, it is still far from Darwin. This is most probably due to ’s current eagerness to restart after every conflict, whereas Darwin backtracks more carefully (cf. Sections 6 and 8). and Darwin also treat variables in decision literals differently. Consequently, detects more (and non-ground) conflicts, but learning conflict-driven clauses can be more expensive, because unifiers must be collected from the conflict graph and composed.

Figure 2: Performance on all benchmarks (provers ordered by performance)
Figure 3: Performance on EPR benchmarks only (provers ordered by performance)

EP-solved 28.2% more problems than TD-and 13.9% more than PD-. This suggests that non-termination of unit-propagation is an uncommon issue in practice: EP-is still able to solve many problems, even though it does not care to bound propagation, whereas the other two variants solve fewer problems because of the overhead of bounding propagation even when it is not necessary. Nevertheless, there were 28 problems solved only by PD-and 26 problems solved only by TD-(among ’s variants). EP-and PD-can solve 9 problems with TPTP difficulty rating 0.5, all from the SYN and FLD domains. 3 of the 9 problems were solved in less than 10 seconds.

8 Conclusions and Future Work

is the first theorem prover based on the new Conflict Resolution calculus. The experiments show a promising, albeit not yet competitive, performance.

A comparison of the performance of the three variants of shows that it is non-trivial to interleave decisions within possibly non-terminating unit-propagations, and further research is needed to determine (possibly in a problem dependent way) optimal initial depth thresholds and threshold incrementation strategies. Alternatively, entirely different criteria could be explored for deciding to make an eager decision before propagation is over. For instance, decisions could be made if a fixed or dynamically adjusted amount of time elapses.

The performance bottleneck that needs to be most urgently addressed in future work is backtracking and restarting. Currently, all variants of restart after every conflict, keeping derived conflict-driven clauses but throwing away the model construct so far. They must reconstruct models from scratch after every conflict. This requires a lot of repeated re-computation, and therefore a significant performance boost could be expected through a more sensible backtracking strategy. ’s current naive unification algorithm could be improved with term indexing [Indexing], and there might also be room to improve ’s rough first-order analogue for the watched literals data structure, even though the first-order challenges make it unlikely that something as good as the propositional watched literals data structure could ever be developed. Further experimentation is also needed to find optimal values for the parameters used in for governing the initial thresholds and their incrementation policies.

’s already acceptable performance despite the implementation improvement possibilities just discussed above indicates that automated theorem proving based on the Conflict Resolution calculus is feasible. However, much work remains to be done to determine whether this approach will eventually become competitive with today’s fastest provers.


We thank Ezequiel Postan for his implementation of TPTP parsers for Skeptik [Skeptik], which we have reused in . We are grateful to Albert A. V. Giegerich, Aaron Stump and Geoff Sutcliffe for all their help in setting up our experiments in StarExec. This research was partially funded by the Australian Government through the Australian Research Council and by the Google Summer of Code 2016 program. Daniyar Itegulov was financially supported by the Russian Scientific Foundation (grant 15-14-00066).