1 Introduction
The outstanding efficiency of current propositional Satsolvers naturally raises the question of whether it would be possible to employ similar ideas for automating firstorder logical reasoning. The recent Conflict Resolution calculus^{1}^{1}1Not to be confused with the homonymous calculus for linear rational inequalities [KorovinTVConflictResolution]. () [ConflictResolution] can be regarded as a crucial initial step to answer this question. From a prooftheoretical perspective, generalizes (to firstorder logic) the two main mechanisms on which modern Satsolvers are based: unit propagation and conflictdriven clause learning. The calculus is sound and refutationally complete, and derivations are isomorphic to implication graphs.
This paper goes one step further by defining proof search algorithms for . Familiarity with the propositional CDCL procedure [CDCL] is assumed, even though it is briefly sketched in Section 2. The main challenge in lifting this procedure to firstorder logic is that, unlike in propositional logic, firstorder unit propagation does not always terminate and true clauses do not necessarily have uniformly true literals (cf. Section 4). Our solutions to these challenges are discussed in Section 5 and Section 6, and experimental results are presented in Section 7.
Related Work:
’s unitpropagating resolution rule can be traced back to unitresulting resolution [URR]. Other attempts to lift DPLL [DavisPutnam, DLL] or CDCL [CDCL] to firstorder logic include Model Evolution [BaumgartnerFODPLL, BaumgartnerTinelli, Baumgartner, BaumgartnerTinelliLemmaLearning], Geometric Resolution [GeometricResolution], NonRedundant Clause Learning [AlagiWeidenbach] and the SemanticallyGuided Goal Sensitive procedure [BonacinaPlaisted1, BonacinaPlaisted2, BonacinaPlaisted3, BonacinaPlaisted4]. A brief summary of these approaches and a comparison with can be found in [ConflictResolution]. Furthermore, many architectures [Equinox, iProver, InstGen, AVATAR, Satallax] for firstorder and higherorder theorem proving use a Satsolver as a black box for propositional reasoning, without attempting to lift it; and Semantic Resolution [SCOtt1, SCOtt2] is yet another related approach that uses externally built firstorder models to guide resolution.
2 Propositional CDCL
During search in the propositional case, a Satsolver keeps a model (a.k.a. trail) consisting of a (conjunctive) list of decision literals and propagated literals. Literals of unit clauses are automatically added to the trail, and whenever a clause has only one literal that is not falsified by the current model, this literal is added to the model (thereby satisfying that clause). This process is known as unitpropagation. If unit propagation reaches a conflict (i.e. a situation where the dual of a literal already contained in the model would have to be added to it), the Satsolver backtracks, removing from the model decision literals responsible for the conflict (as well as propagated literals entailed by the removed decision literals) and deriving, or learning, a conflictdriven clause consisting^{2}^{2}2In practice, optimizations (e.g. 1UIP) are used, and more sophisticated clauses, which are not just disjunctions of duals of the decision literals involved in the conflict, can be derived. But these optimizations are inessential to the focus of this paper. of duals of the decision literals responsible for the conflict (or the empty clause, if there were no decision literals). If unit propagation terminates without reaching a conflict and all clauses are satisfied by the model, then the input clause set is satisfiable. If some clauses are still not satisfied, the Satsolver chooses and assigns another decision literal, adding it to the trail, and satisfying the clauses that contain it.
3 Conflict Resolution
The inference rules of the conflict resolution calculus are shown in Figure 1. The unit propagating resolution rule is a chain of restricted resolutions with unit clauses as left premises and a unit clause as final conclusion. Decision literals are denoted by square brackets, and the conflictdriven clause learning rule infers a new clause consisting of negations of instances of decision literals used to reach a conflict (a.k.a. the empty clause ). A clause learning inference is said to discharge the decision literals that it uses. As in the resolution calculus, derivations are directed acyclic graphs that are not necessarily treelike. A refutation is a derivation of with no undischarged decision literals.
From a natural deduction point of view, a unit propagating resolution rule can be regarded as a chain of implication eliminations taking unification into account, whereas decision literals and conflict driven clause learning are reminiscent of, respectively, assumptions and chains of negation introductions, also generalized to firstorder through unification. Therefore, can be considered a firstorder hybrid of resolution and natural deduction.
UnitPropagating Resolution:
where is a unifier of and , for all .
Conflict:
where is a unifier of and .
ConflictDriven Clause Learning:
where (for and ) is the
composition of all substitutions used on the th path^{3}^{3}3Since a proof DAG is not necessarily treelike, there may be more than one path connecting to in the DAGlike proof. from to .
4 Lifting Challenges
Firstorder logic presents many new challenges for methods based on propagation and decisions, of which the following can be singled out:
(1) nontermination of unitpropagation:
In firstorder logic, unit propagation may never terminate. For example, the clause set is clearly unsatisfiable, because there is no assignment of and to true or false that would satisfy all the last four clauses. However, unit propagation would derive the following infinite sequence of units, by successively resolving with previously derived units, starting with : . Consequently, a proof search strategy that would wait for unit propagation to terminate before making decisions would never be able to conclude that the given clause set is unsatisfiable.
(2) absence of uniformly true literals in satisfied clauses:
While in the propositional case, a clause that is true in a model always has at least one literal that is true in that model, this is not so in firstorder logic, because shared variables create dependencies between literals. For instance, the clause set is satisfiable, but there is no model where is uniformly true (i.e. true for all instances of ) or is uniformly true.
(3) propagation without satisfaction:
In the propositional case, when only one literal of a clause is not false in the model, this literal is propagated and added to the model, and the clause necessarily becomes true in the model and does not need to be considered in propagation anymore, at least until backtracking. In the firstorder case, on the other hand, a clause such as would propagate the literal in a model containing , but does not become true in a model where is true. It must remain available for further propagations. If, for instance, the literal is added to the model, the clause will be used again to propagate .
(4) quasifalsification without propagation:
A clause is quasifalsified by a model iff all but one of its literals are false in the model. In firstorder logic, in contrast to propositional logic, it is not even the case that a clause will necessarily propagate a literal when only one of its literals is not false in the model. For instance, the clause is quasifalsified in a model containing and , but no instance of can be propagated.
The first two challenges affect search in a conceptual level, and solutions are discussed in Section 5. The last two prevent a direct firstorder generalization of the data structures (e.g. watched literals) that make unit propagation so efficient in the propositional case. Partial solutions are discussed in Section 6.
5 FirstOrder Model Construction and Proof Search
Despite the fundamental differences between propositional and firstorder logic described in the previous section, the firstorder algorithms presented aim to adhere as much as possible to the propositional procedure sketched in the Section 2. As in the propositional case, the model under construction is a (conjunctive) list of literals, but literals may now contain (universal) variables. If a literal is in a model , then any instance is said to be true in . Note that checking that a literal is true in a model is more expensive in firstorder logic than in propositional logic: whereas in the latter it suffices to check that is in , in the former it is necessary to find a literal in and a substitution such that . A literal is said to be strongly true in a model iff is in .
There is a straightforward solution for the second challenge (i.e. the absence of uniformly true literals in satisfied clauses): a clause is satisfied by a model iff all its relevant instances have a literal that is true in , where an instance is said to be relevant if it substitutes the clause’s variables by terms that occur in . Thus, for instance, the clause is satisfied by the model , because both relevant instances and have literals that are true in the model. However, this solution is costly, because it requires the generation of many instances. Fortunately, in many (though not all) cases, a satisfied clause will have a literal that is true in , in which case the clause is said to be uniformly satisfied. Uniform satisfaction is cheaper to check than satisfaction. However, a drawback of uniform satisfaction is that the model construction algorithm may repeatedly attempt to satisfy a clause that is not uniformly satisfied, by choosing one of its literals as a decision literal. For example, the clause is not uniformly satisfied by the model . Without knowing that this clause is already satisfied by the model, the procedure would try to choose either or as decision literal. But both choices are useless decisions, because they would lead to conflicts with conflictdriven clauses equal to a previously derived clause or to a unit clause containing a literal that is part of the current model. A clause is said to be weakly satisfied by a model if and only if all its literals are useless decisions.
Because of the first challenge (i.e. the nontermination of unitpropagation in the general firstorder case), it is crucial to make decisions during unit propagation. In the example given in item 1 of Section 4, for instance, deciding
at any moment would allow the propagation of
and (respectively due to the 4th and 6th clauses), triggering a conflict. The learned clause would be and it would again trigger a conflict by the propagation of and (this time due to the 3rd and 5th clauses). As this last conflict does not depend on any decision literal, the empty clause is derived and thus the clause set is refuted. The question is how to interleave decisions and propagations. One straightforward approach is to keep track of the propagation depth^{4}^{4}4Because of the isomorphism between implication graphs and subderivations in Conflict Resolution [ConflictResolution], the propagation depth is equal to the corresponding subderivation’s height, where initial axiom clauses and learned clauses have height and the height of the conclusion of a unitpropagating resolution inference is where is the maximum height of its unit premises. in the implication graph: any decision literal or literal propagated by a unit clause has propagation depth ; any other literal has propagation depth , where is the maximum propagation depth of its predecessors. Then propagation is performed exhaustively only up to a propagation depth threshold . A decision literal is then chosen and the threshold is incremented. Such eager decisions guarantee that a decision will eventually be made, even if there is an infinite propagation path. However, eager decisions may also lead to spurious conflicts generating useless conflictdriven clauses. For instance, the clause set (where clauses have been numbered for easier reference) is unsatisfiable, because a conflict with no decisions can be obtained by propagating (by 1), and then , , …, , (by 2, repeatedly), which conflicts with (by 3). But the former propagation has depth . If the propagation depth threshold is lower than , a decision literal is chosen before that conflict is reached. If is chosen, for example, in an attempt to satisfy the sixth clause, there are propagations (using and clauses 1, 4, 5 and 6) with depth lower than the threshold and reaching a conflict that generates the clause , which is useless for showing unsatisfiability of the whole clause set. This is not a serious issue, because useless clauses are often generated in conflicts with noneager decisions as well. Nevertheless, this example suggests that the starting threshold and the strategy for increasing the threshold have to be chosen wisely, since the performance may be sensitive to this choice.Interestingly, the problem of nonterminating propagation does not manifest in fragments of firstorder logic where infinite unit propagation paths are impossible. A wellknown and large fragment is the effectively propositional (a.k.a. BernaysSchönfinkel) class, consisting of sentences with prenex forms that have an quantifier prefix and no function symbols. For this fragment, a simpler proof search strategy that only makes decisions when unit propagation terminates, as in the propositional case, suffices. Infinite unit propagation paths do not occur in the effectively propositional fragment because there are no function symbols and hence the term depth^{5}^{5}5The depth of constants and variables is zero and the depth of a complex term is when is the maximum depth of its proper subterms. does not increase arbitrarily. Whenever the term depth is bounded, infinite unit propagation paths cannot occur, because there are only finitely many literals with bounded term depth (given the finite set of constant, function and predicate symbols with finite arity occurring in the clause set).
The insight that term depth is important naturally suggests a different approach for the general firstorder case: instead of limiting the propagation depth, limit the term depth instead, allowing arbitrarily long propagations as long as the term depth of the propagated literals are smaller than the current term depth threshold. A literal is propagated only if its term depth is smaller than the threshold. New decisions are chosen when the termdepthbounded propagation terminates and there are still clauses that are not uniformly satisfied. As before, eager decisions may lead to spurious conflicts, but bounding propagation by term depth seems intuitively more sensible than bounding it by propagation depth.
6 Implementation Details
is implemented in Scala and its source code and usage instructions are available in https://gitlab.com/aossie/Scavenger. Its packrat combinator parsers are able to parse TPTP CNF files [TPTP]. Although is a firstorder prover, every logical expression is converted to a simply typed lambda expression, implemented by the abstract class E with concrete subclasses Sym, App and Abs for, respectively, symbols, applications and abstractions. A trait Var is used to distinguish variables from other symbols. Scala’s case classes are used to make E behave like an algebraic datatype with (patternmatchable) constructors. The choice of simply typed lambda expressions is motivated by the intention to generalize to multisorted firstorder logic and higherorder logic and support TPTP TFF and THF in the future. Every clause is internally represented as an immutable twosided sequent consisting of a set of positive literals (succedent) and a set of negative literals (antecedent).
When a problem is unsatisfiable, can output a refutation internally represented as a collection of ProofNode objects, which can be instances of the following immutable classes: UnitPropagatingResolution, Conflict, ConflictDrivenClauseLearning, Axiom, Decision. The first three classes correspond directly to the rules shown in Figure 1. Axiom is used for leaf nodes containing input clauses, and Decision represents a fictive rule holding decision literals. Each class is responsible for checking, typically through require statements, the soundness conditions of its corresponding inference rule. The Axiom, Decision and ConflictDrivenClauseLearning classes are less than 5 lines of code each. Conflict and UnitPropagatingResolution are respectively 15 and 35 lines of code. The code for analyzing conflicts, traversing the subderivations (conflict graphs) and finding decisions that contributed to the conflict, is implemented in a superclass, and is 17 lines long.
The following three variants of were implemented:

EP: aiming at the effectively propositional fragment, propagation is not bounded, and decisions are made only when propagation terminates.

PD: Propagation is bounded by a propagation depth threshold starting at . Input clauses are assigned depth . Derived clauses and propagated literals obtained while the depth threshold is are assigned depth . The threshold is incremented whenever every input clause that is neither uniformly satisfied nor weakly satisfied is used to derive a new clause or to propagate a new literal. If this is not the case, a decision literal is chosen (and assigned depth ) to uniformly satisfy one of the clauses that is neither uniformly satisfied nor weakly satisfied.

TD: Propagation is bounded by a term depth threshold starting at
. When propagation terminates, a stochastic choice between either selecting a decision literal or incrementing the threshold is made with probability of 50% for each option. Only uniform satisfaction of clauses is checked.
The third and fourth challenges discussed in Section 4 are critical for performance, because they prevent a direct firstorder generalization of data structures such as watched literals, which enables efficient detection of clauses that are ready to propagate literals. Without knowing exactly which clauses are ready to propagate, (in its three variants) loops through all clauses with the goal of using them for propagation. However, actually trying to use a given clause for propagation is costly. In order to avoid this cost, performs two quicker tests. Firstly, it checks whether the clause is uniformly satisfied (by checking whether one of its literals belongs to the model). If it is, then the clause is dismissed. This is an imperfect test, however. Occasionally, some satisfied clauses will not be dismissed, because (in firstorder logic) not all satisfied clauses are uniformly satisfied. Secondly, for every literal of every clause, keeps a set of decision literals and propagated literals that are unifiable with . A clause is quasifalsified when at most one literal of has an empty set associated with it. This is a rough analogue of watched literals for detecting quasifalsified clauses. Again, this is an imperfect test, because (in firstorder logic) not all quasifalsified clauses are ready to propagate. Despite the imperfections of these tests, they do reduce the number of clauses that need to be considered for propagation, and they are quick and simple to implement.
Overall, the three variants of listed above have been implemented concisely. Their main classes are only 168, 342 and 176 lines long, respectively, and no attempt has been made to increase efficiency at the expense of the code’s readability and pedagogical value. Premature optimization would be inappropriate for a first proofofconcept.
still has no sophisticated backtracking and restarting mechanism, as propositional Satsolvers do. When reaches a conflict, it restarts almost completely: all derived conflictdriven clauses are kept, but the model under construction is reset to the empty model.
7 Experiments
Experiments were conducted^{6}^{6}6Raw experimental data are available at https://doi.org/10.5281/zenodo.293187. in the StarExec cluster [StarExec] to evaluate ’s performance on TPTP v6.4.0 benchmarks in CNF form and without equality. For comparison, all other 21 provers available in StarExec’s TPTP community and suitable for CNF problems without equality were evaluated as well. For each job pair, the timeouts were 300 CPU seconds and 600 Wallclock seconds.
Problems Solved  

Prover  EPR  All 
PEPR0.0ps  432  432 
GrAnDe1.1  447  447 
Paradox3.0  467  506 
ZenonModulo0.4.1  315  628 
TDScavenger  350  695 
PDScavenger  252  782 
GeoIII2016C  344  840 
EPScavenger  349  891 
Metis2.3  404  950 
Z34.4.1  507  1027 
ZipperpinFOF0.4  400  1029 
Otter3.3  362  1068 
Problems Solved  

Prover  EPR  All 
Bliksem1.12  424  1107 
SOS2.0  351  1129 
CVC4FOF1.5.1  452  1145 
SNARK20120808  417  1150 
Beagle0.9.47  402  1153 
EDarwin1.5  453  1213 
Prover91109a  403  1293 
Darwin1.4.5  508  1357 
iProver2.5  551  1437 
ET0.2  486  1455 
E2.0  489  1464 
Vampire4.1  540  1524 
Table 1 shows how many of the 1606 unsatisfiable CNF problems and 572 effectively propositional (EPR) unsatisfiable CNF problems each theorem prover solved; and figures 2 and 3 shows the performance in more detail. For a first implementation, the best variants of show an acceptable performance. All variants of outperformed PEPR, GrAnDe, DarwinFM, Paradox and ZenonModulo; and EPadditionally outperformed GeoIII. On the effectively propositional propblems, TDoutperformed LEOII, ZenonModulo and GeoIII, and solved only 1 problem less than SOS2.0 and 12 less than Otter3.3. Although Otter3.3 has long ceased to be a stateoftheart prover and has been replaced by Prover9, the fact that solves almost as many problems as Otter3.3 is encouraging, because Otter3.3 is a mature prover with 15 years of development, implementing (in the C language) several refinements of proof search for resolution and paramodulation (e.g. orderings, set of support, splitting, demodulation, subsumption) [Otter, OtterManual], whereas is a yet unrefined and concise implementation (in Scala) of a comparatively straightforward search strategy for proofs in the Conflict Resolution calculus, completed in slightly more than 3 months. Conceptually, GeoIII (based on Geometric Resolution) and Darwin (based on Model Evolution) are the most similar to . While already outperforms GeoIII, it is still far from Darwin. This is most probably due to ’s current eagerness to restart after every conflict, whereas Darwin backtracks more carefully (cf. Sections 6 and 8). and Darwin also treat variables in decision literals differently. Consequently, detects more (and nonground) conflicts, but learning conflictdriven clauses can be more expensive, because unifiers must be collected from the conflict graph and composed.
EPsolved 28.2% more problems than TDand 13.9% more than PD. This suggests that nontermination of unitpropagation is an uncommon issue in practice: EPis still able to solve many problems, even though it does not care to bound propagation, whereas the other two variants solve fewer problems because of the overhead of bounding propagation even when it is not necessary. Nevertheless, there were 28 problems solved only by PDand 26 problems solved only by TD(among ’s variants). EPand PDcan solve 9 problems with TPTP difficulty rating 0.5, all from the SYN and FLD domains. 3 of the 9 problems were solved in less than 10 seconds.
8 Conclusions and Future Work
is the first theorem prover based on the new Conflict Resolution calculus. The experiments show a promising, albeit not yet competitive, performance.
A comparison of the performance of the three variants of shows that it is nontrivial to interleave decisions within possibly nonterminating unitpropagations, and further research is needed to determine (possibly in a problem dependent way) optimal initial depth thresholds and threshold incrementation strategies. Alternatively, entirely different criteria could be explored for deciding to make an eager decision before propagation is over. For instance, decisions could be made if a fixed or dynamically adjusted amount of time elapses.
The performance bottleneck that needs to be most urgently addressed in future work is backtracking and restarting. Currently, all variants of restart after every conflict, keeping derived conflictdriven clauses but throwing away the model construct so far. They must reconstruct models from scratch after every conflict. This requires a lot of repeated recomputation, and therefore a significant performance boost could be expected through a more sensible backtracking strategy. ’s current naive unification algorithm could be improved with term indexing [Indexing], and there might also be room to improve ’s rough firstorder analogue for the watched literals data structure, even though the firstorder challenges make it unlikely that something as good as the propositional watched literals data structure could ever be developed. Further experimentation is also needed to find optimal values for the parameters used in for governing the initial thresholds and their incrementation policies.
’s already acceptable performance despite the implementation improvement possibilities just discussed above indicates that automated theorem proving based on the Conflict Resolution calculus is feasible. However, much work remains to be done to determine whether this approach will eventually become competitive with today’s fastest provers.
Acknowledgments:
We thank Ezequiel Postan for his implementation of TPTP parsers for Skeptik [Skeptik], which we have reused in . We are grateful to Albert A. V. Giegerich, Aaron Stump and Geoff Sutcliffe for all their help in setting up our experiments in StarExec. This research was partially funded by the Australian Government through the Australian Research Council and by the Google Summer of Code 2016 program. Daniyar Itegulov was financially supported by the Russian Scientific Foundation (grant 151400066).
Comments
There are no comments yet.