A Deep Reinforcement Learning based Approach to Learning Transferable Proof Guidance Strategies

11/05/2019 ∙ by Maxwell Crouse, et al. ∙ 38

Traditional first-order logic (FOL) reasoning systems usually rely on manual heuristics for proof guidance. We propose TRAIL: a system that learns to perform proof guidance using reinforcement learning. A key design principle of our system is that it is general enough to allow transfer to problems in different domains that do not share the same vocabulary of the training set. To do so, we developed a novel representation of the internal state of a prover in terms of clauses and inference actions, and a novel neural-based attention mechanism to learn interactions between clauses. We demonstrate that this approach enables the system to generalize from training to test data across domains with different vocabularies, suggesting that the neural architecture in TRAIL is well suited for representing and processing of logical formalisms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated theorem provers (ATPs) have established themselves as useful tools for solving problems that are expressible in a variety of knowledge representation formalisms (e.g. first-order logic). Such problems are commonplace in domains like software design and verification, where ATPs are used to prove that a system satisfies some formal design specification.

Unfortunately, while the formalisms that underlie such problems are (more or less) fixed, the strategies needed to solve them have been anything but. With each new domain added to the purview of automated theorem proving, there has been a need for the development of new heuristics and strategies that restrict or order how the ATP searches for a proof. The process of guiding a theorem prover during proof search with such heuristics or techniques is referred to as proof guidance.

Many state-of-the-art ATPs use machine learning to automatically determine heuristics for assisting with proof guidance. Generally, the features considered by such learned heuristics would be manually designed

[KUV-IJCAI15-features, JU-CICM17-enigma]

, though more recently they have been learned through deep-learning

[LoosISK-LPAR17-atp-nn, ChvaJaSU-CoRR19-enigma-ng, JU-CoRR19-enigma-hammering, paliwal2019graph]. For instance, recent work by [LoosISK-LPAR17-atp-nn] demonstrated that automated systems could outperform a state-of-the-art theorem prover when deep-learning was used for guiding the initial stages of proof search (before switching to proof guidance with traditional methods). In a similar vein, [ChvaJaSU-CoRR19-enigma-ng, JU-CoRR19-enigma-hammering] introduced theorem proving approaches based purely on deep-learning and showcased them in a bootstrapped evaluation setting where they iteratively trained and tested their methods on a single shared set of problems (with the objective being to maximize the total number of problems solved).

More recent approaches have used reinforcement learning (RL), where the system automatically learns the right proof guidance strategy from proof attempts. RL has been deployed in logics less expressive than first-order logic (FOL) [KuYaSa-CoRR18-intuit-pl, LeRaSe-CoRR18-heuristics-atp-rl, CheTi-CoRR18-rew-rl], as well as FOL [KalUMO-NeurIPS18-atp-rl, ZoCMiKU-ATP19-atp-rl-curric], where it has been used with customized reasoners (e.g., a tableau based reasoner [KalUMO-NeurIPS18-atp-rl]). However, to the best of our knowledge, there has not yet been any attempt to build a RL based proof guidance system that is capable of showing transfer across different problem domains within a logical formalism.

In this paper, we describe our approach to build TRAIL, an RL based proof guidance system which can transfer across different problem domains within a logical formalism. At the heart of TRAIL is (a) a novel representation of clauses, which abstracts away the specifics of the vocabulary of a problem, (b) a novel representation of the entire state of a theorem prover, which allows the system to consider the problem as a whole. This approach forces the system to focus less on individual patterns and symbols in a clause, and more on interactions between clause representations, which helps the system generalize across problem domains. To our knowledge, we are the first to design a deep reinforcement based proof guidance system with the explicit goal of building abstractions that suit transfer across problem domains.

We also evaluate multiple training regimens for transfer to determine which mechanism provides the best generalization: (a) we randomly explore from a tabula rasa state as done in AlphaZero [alphazero] to build a proof guidance system that maximizes exploration, (b) we explore the effectiveness of bootstrapping the system based on proofs from existing reasoners in a specific problem domain, where the system can effectively learn from highly optimized reasoners, and (c) we combine the effects of imitation with exploration. We examine how well the three regimens provide transfer across problem domains.

In summary, the core contributions of our work are as follows: (a) We propose a novel deep reinforcement based proof guidance system for FOL, which is general enough to allow transfer to problems in different domains. (b) We demonstrate the efficacy of this system, by showing that it is just as effective as a system which is trained on a problem set where all commonality in vocabulary between problems is removed. (c) We show significant generalization across different problem domains, from Mizar to TPTP and vice versa (after ensuring no overlap between the two data sets in terms of problems and vocabularies111No overlap between TPTP and Mizar is achieved by (1) removing common problems, and (2) by consistent renaming, within each problem, all predicates, functions, and constants), when compared to training and testing on the same domain. (d) We ensure that the performance of our neural based system is competitive with saturation based FOL reasoners, suggesting that the generalized learning shown by TRAIL approaches that of manually optimized reasoners.

2 Background: Reasoning in FOL

We assume the reader has knowledge of basic first-order logic and automated theorem proving terminology and thus will only briefly describe the terms commonly seen throughout this paper. For readers interested in learning more about logical formalisms and techniques see [thelogicbook, enderton2001mathematical].

In this work, we focus on first-order logic (FOL) with equality. In the standard FOL problem-solving setting, an ATP is given a conjecture (i.e., a formula to be proved true or false), axioms (i.e., formulas known to be true), and inference rules (i.e., rules that, based on given formulas, allow for the derivation of new true formulas). From these inputs, the ATP performs a proof search, which can be characterized as the successive application of inference rules to axioms and derived formulas until a sequence of derived formulas is found that represents a proof of the given conjecture.

The types of formulas considered here are clauses, i.e. disjunctions of literals (where a literal is a (un)negated formula that otherwise has no inner logical connectives). We further specify all variables to be implicitly universally quantified.

The theorem prover compared against in this work, Beagle [Beagle2015], is saturation based. A saturation based theorem prover maintains two sets of clauses, referred to as the processed and unprocessed sets of clauses. These two sets correspond to the clauses that have and have not been yet selected for inference. The actions that saturation based theorem provers take are referred to as inferences. Inferences involve an inference rule (e.g. resolution, factoring) and a non-empty set of clauses, considered to be the premises of the rule. At each step in proof search, the ATP selects an inference with premises in the unprocessed set (some premises may be part of the processed set) and executes it. This generates a new set of clauses, each of which is added to the unprocessed set. The clauses in the premises that are members of the unprocessed set are then added to the processed set. This iteration continues until a clause is generated (in most cases, the empty clause) that signals a proof has been found.

3 Trail

We first describe our overall approach to defining the proof guidance problem in terms of reinforcement learning. We then describe two novel aspects of TRAIL that help with generalized learning of proof guidance: (a) a vectorization process which represents all clauses and actions within a proof state in a way that abstracts away the specifics of the vocabulary of a problem and (b) an attention-based policy network that learns the interactions between clauses and actions to select the next action. Last, we describe the three different learning regimens we used to see if generalization varies as a function of training procedure.

3.1 RL based Proof Guidance

Figure 1: Formulation of automated theorem proving as a RL problem.

We formalize the guidance process as an RL problem where the reasoner provides the environment in which the learning agent operates. Figure 1 shows how an ATP problem is solved in our framework. Given a conjecture and a set of axioms, TRAIL iteratively performs reasoning steps until a proof is found (within a provided time limit). The reasoner tracks the state of the proof, , which encapsulates both the clauses that have been derived or used in the derivation so far and the actions that can be taken by the reasoner at the current step. At each step, this state is passed to the learning agent: an attention-based model [luong2015attention] that predicts a distribution over the actions and uses it to sample a corresponding action, . This action is then given to the reasoner, which executes it and updates the proof state.

Formally, a state, , consists of:

  • , the set of processed clauses, (i.e., all clauses selected by the agent up to step ); where .

  • , the set of all available actions that the reasoner could execute at step . An action, , is a pair comprising an inference rule, , and a clause, . is an axiom, the negated conjecture, or a derived clause that has yet to be selected by the agent. Informally, represents the unprocessed clauses at step and the inference rules applicable to them. is the cross product of the set of all inference rules (denoted by ) and the set containing all axioms and the negated conjecture.

At step , given a state

(provided by the reasoner), the learning agent computes a probability distribution over the set of available actions

, denoted by (where is the set of parameters for the learning agent), and samples an action . The sampled action is then executed by the reasoner by applying to (which may involve other clauses from the processed set ). This yields a set of new derived clauses, , and a new state, , where and .

Upon completion of a proof attempt, TRAIL must compute a loss and issue a reward that encourages the agent to optimize for decisions leading to a successful proof in the smallest number of steps. Specifically, for an unsuccessful proof attempt (i.e., the underlying reasoner fails to derive a contradiction within the time limit), each step in the attempt is assigned a reward . For a successful proof attempt, in the final step, the underlying reasoner produces a parsimonious refutation proof containing only the actions that generated derived facts directly or indirectly involved in the final contradiction. At step of a successful proof attempt where the action is selected, the reward is if is not part of this minimal refutation proof ; otherwise is inversely proportional to the total number of steps in the proof attempt.

The final loss consists of the standard policy gradient loss [sutton1998reinforcement] and an entropy regularization term to avoid collapse onto a sub-optimal deterministic policy and to promote exploration.


where indicates the action selected at step (i.e., if the action is selected at ; otherwise ), and is the entropy regularization hyper-parameter.

We use a normalized reward to improve stability of training as the intrinsic difficulty of problems can vary widely in our problem dataset. We explored (i) normalization by the inverse of the number of steps performed by a mature traditional reasoner (in this work, Beagle), (ii) normalization by the best reward obtained in repeated attempts to solve the same problem, and (iii) no normalization; the normalization strategy was a hyper-parameter. This loss has the effect of giving actions that contributed to the most direct proofs for a given problem higher rewards, while dampening actions that contributed to lengthier proofs for the same problem.

Figure 2: Overview of the vectorizer operating on the clause .

3.2 Vectorization Process

Clause Vectorization: Figure 2 shows an example of the vectorizer operating on a clause with a single positive literal. Our method for vectorization is informed by how inference rules operate over clauses. Consider the resolution inference rule. Two clauses resolve if one contains a positive literal whose constituent atom is unifiable with the constituent atom of a negative literal in the other clause. Hence, vector representations of clauses should capture the relationship between literals and their negations as well as reflect structural similarities between literals that are indicative of unifiability.

Our approach captures these features by deconstructing input clauses into sets of patterns. We define a pattern to be a linear chain that begins from a predicate symbol and includes one argument (and its argument position) at each depth until it ends at a constant or variable. The set of all patterns for a given clause is then simply the set of all linear paths between each predicate and the constants and variables they bottom out with. Since the names of variables are arbitrary, they are replaced with a wild-card symbol, “”, indicating that the element may match with anything. Argument position is also indicated with the use of wild-card symbols. Going back to the clause in Figure 2, we obtain the patterns and ).

To ensure that our learned features are not overfit to the specific vocabulary of any particular problem, our approach systematically renames patterns prior to vectorization by appending a unique identifier (generated from the proof attempt number) to each predicate, function, and constant (e.g. for the second proof attempt our patterns would become and ). This change does not influence the underlying semantics of a problem, as every predicate, function, and constant has been renamed consistently. The motivation behind this change is to force our approach to learn higher level, more transferable proof search strategies that depend on semantics and structures rather than on individual symbols.

We obtain a -dimensional representation of a clause, , by hashing the linearization of each pattern using MD5 hashes [rivest1992md5] to compute a hash value , and setting the element at index to the number of occurrences of the pattern in the clause . Further, we explicitly encode the difference between patterns and their negations by doubling the representation size and hashing them separately, so that the first elements encode the positive patterns and the second

elements encode the negated patterns. This hashing approach greatly condenses the representation size compared to a one-hot encoding of each pattern.

Action Vectorization: Since actions are pairs of clauses and inference rules, our approach represents the clause in each action pair using the process described above and represents the inference rule as a one-hot encoding of size . We write to denote the encoding for action .

3.3 Attention Based Policy Network

The architecture of the policy network in TRAIL is displayed in Figure 3. The input to the policy network is the set of processed clause representations, , and the set of action representations, . First, we transform the sparse clause representations (from the set of processed clauses or actions) into dense representations by passing them through a set of fully-connected layers. This yields sets and of dense representations for the processed and action clauses, respectively. Then, for each action pair, we concatenate the clause representation with the corresponding inference representation to form the new action representation . The resulting sets of new clause and action representations are joined into matrices and , respectively.

Throughout the reasoning process, the policy network must produce a distribution over the actions relative to the clauses that have been selected up to the current step, where both the actions and clauses are sets of variable length. This setting is analogous to the ones seen in attention-based approaches to problems like neural machine translation 

[luong2015attention, vaswani2017attention] and video captioning [yu2016video, whitehead2018kavd], in which the model must score each encoder state with respect to a decoder state or the other encoder states. To score each action relative to each clause, we compute a multiplicative attention [luong2015attention] as


where is a learned parameter and the resulting matrix,

, is a heat map of interaction scores between processed clauses and available actions. We then perform max pooling over the columns (i.e., clauses) of

to find a single score for each action and apply a softmax normalization to the pooled scores to obtain the distribution over the actions, .

Figure 3: Policy network architecture.

3.4 Training Regimens

We use one of the following three strategies to attempt to solve all the problems in the training set in order to determine if training regimens also determine how effective generalization is.

  • Random Exploration: We randomly explore the search space as done in AlphaZero [alphazero] to establish performance when the system is started from a tabula rasa state (i.e., a randomly initialized policy network ). At training, at an early step (i.e., , where , the temperature threshold, is a hyper-parameter that indicates the depth in the reasoning process at which training exploration stops), we sample the action in the set of available actions , according to the following probability distribution derived from the policy network :

    where , the temperature, is a hyper-parameter that controls the exploration-exploitation trade-off and decays over the iterations (a higher temperature promotes more exploration). On the other hand, when the number of steps already performed is above the temperature threshold (i.e., ), an action, , with the highest probability from the policy network, is selected: . At the end of training iteration , the newly collected examples and those collected in the previous iterations (

    is the example buffer hyper-parameter) are used to train, in a supervised manner, the policy network using the reward structure and the loss function defined in the approach overview section. The updated policy network is retained for the next iteration if it is superior to the previous one in terms of number of problems solved on the validation problem set; otherwise, it is discarded. At validation and testing, exploration is disabled (i.e., the temperature threshold is set to 0).

    This approach has the disadvantage that the system spends a significant amount of time in unproductive parts of the search space, but it may help transfer because of the random exploration of search space.

  • Expert Bootstrapping Learning: We explore the effectiveness of bootstrapping the RL process. This approach is similar to random exploration with the exception that, at the first iteration, the initial randomly initialized policy network model is trained, in a supervised manner, using examples from problems from the training set solved by an existing reasoner (in this work we use Beagle [Beagle2015]). Thus, the first iteration ends with a model trained by an expert, then training proceeds exactly as in the first random exploration approach. We contrast this approach versus the random exploration strategy for generalization.

  • Exploratory Imitation Learning:

    Similar to expert bootstrapping, we bootstrap the training with examples from an existing reasoner (our expert). But, in later iterations, we reduce our reliance on this reasoner for example collection. Specifically, at iteration at training, for a step in the reasoning process, with a probability , we delegate the selection of the action from the list of available actions to the expert, and, with a probability , we follow the same action selection strategy as in the random exploration approach. Here is a hyper-parameter controlling the decay of our reliance on the expert reasoner. At validation and testing, there is no reliance on the expert. This approach is the middle ground between random exploration and expert based learning, and is hence a useful datapoint in understanding the effectiveness of either on generalization.

4 Experiments and Results

In this section, we are trying to answer the following questions: (a) How well does the proof guidance system in TRAIL generalize across problem domains? To evaluate this question, we trained on two different datasets, and then measured generalization across them. (b) What factors contribute to generalization? To evaluate this question, we examined if differences in training regimens in RL, or clausal representations affect generalization. (c) Is TRAIL effective at generalized proof guidance? As a sanity check, we tested whether TRAIL’s performance is competitive with existing implementations.

4.1 Datasets and Experimental Setup

We evaluated TRAIL using problems drawn from both the Mizar222https://github.com/JUrban/deepmath/ [mizar] dataset and the Thousands of Problems for Theorem Provers (TPTP)333http://tptp.cs.miami.edu/ dataset. Mizar is a well known mathematical library of formalized and mechanically-verified mathematical problems. TPTP is the definitive benchmarking library for theorem provers, designed to test ATP performance across a wide range of problem domains. Problems from Mizar were drawn from the subset used by [Alemi-NIPS16-deepmath], i.e. the subset of problems known to be solvable by existing ATPs (this subset was used to allow for a direct comparison against our baseline reasoner). From the TPTP dataset, a random subset of 2,000 (TPTP-2k) problems were selected from various problem domains. For transfer experiments, we ensure no overlap between the two datasets in terms of problems and vocabularies by (a) removing common problems from the test sets, and (b) by consistent renaming, within each problem, all predicates, functions, and constants.

We used gradient-boosted tree search from scikit-optimize

444https://scikit-optimize.github.io/ to find effective hyper-parameters using 10% of the Mizar dataset. This resulted in the hyper-parameter values shown on Table 1. We then selected a different 10% subset of the dataset (completely disjoint from the one used for hyper-parameter tuning), performed a 3-fold cross evaluation on it, and, for each iteration, we report the average across the combined set of problems in all folds. We set the maximum time limit for solving a problem to 100 seconds. Experiments were conducted over a cluster of 19 CPU machines (56 x 2.0 GHz cores & 247 GB RAM) and 10 GPU machines (2 x P100 GPU, 16 x 2.0 GHz CPU cores, & 120 GB RAM) over 4 to 5 days (for hyper-parameter tuning, we added 5 CPU and 2 GPU machines).

We use two metrics (i) completion ratio: the proportion of problems solved within the specified time limit, (ii) average proof length: average number of steps taken to find a proof.

layers units per layer dropout (reg.) (sparse vector size) (temp.) (temp. threshold) (expert decay) reward normalization
2 161 0.57 0.004 645 1.13 11 0.75 (iii) No normalization
Table 1: Hyper-parameter values
Training Regimens Testing
Mizar TPTP
Random model (no training) 29.5 8.6
Training on
Tabula Rasa 57.1 26.3
Expert Bootstrapping 57.2 26.3
Exploratory Imitation 56.8 24.9
Training on
Tabula Rasa 51.3 22.7
Expert Bootstrapping 53.2 24.9
Exploratory Imitation 53.0 25.4
Table 2: Percentage of problems solved by each training regimens and how well it transfer to other domains.

4.2 Generalization Across Problem Domains

In this experiment, we show TRAIL’s performance when we train TRAIL on Mizar and test its performance on TPTP and vice versa. As described earlier, these two datasets have very different vocabularies. In examining the factors that help this sort of generalization across problem domains, we examine the effect of training regimens, as well as the efficacy of the clausal representations.

Training on Different Regimens: Table 2 shows the performance of the different training regimens and how much each regimen facilitates generalization. As a baseline, we also include the performance of a model with randomly initialized weights, without any training. This randomly initialized model could solve 29.5% of Mizar problems and 8.6% of TPTP problems. We can see first that TRAIL when tested and trained on the same domain performed well compared to the randomly initialized models (a statistical test of for Mizar, for TPTP), suggesting that learning did occur. Notice that TRAIL solved 57.03% of the Mizar test problems when trained on the Mizar dataset across regimens, and slightly less on Mizar when trained on TPTP at 52.5% (a statistical test of suggested statistical significance). Still this model was statistically significantly better than random (). For TPTP, training on TPTP (24.33% of TPTP test problems solved) was statistically just as good as training on Mizar and testing on TPTP (25.83% of TPTP test problems solved). In general, training regimens had no effect on success rates for either Mizar or TPTP. A similar result emerged for proof length, with the exception of a statistically significant advantage for exploratory imitation learning (63.67) compared to tabula rasa (67.75) or expert bootstrapping (69.07), using as the statistical test (). In all other cases such as transfer from TPTP to Mizar, or vice versa or training on TPTP, these differences were not statistically significant, so we do not describe these results further. We therefore conclude that TRAIL provides very good transfer across different problem domains, regardless of type of regimen.

Renaming No Renaming
Model trained on Mizar 57.1 26.3 57.3 24.5
Model trained on TPTP 51.3 22.2 48.1 21.4
Table 3: Performance (completion ratio) of TRAIL (Tabula Rasa) at testing with and without patterns renaming.

The Role of Vector Representation: Recall that our vectorizer systematically renames patterns prior to vectorization to make sure it will not overfit to the specific vocabulary of any particular problem (see Section 3.2). In this experiment, we test the effect of pattern renaming on the performance of TRAIL (trained from a tabula rasa state) and its effect on generalization. Table 3 shows the testing performance of TRAIL on Mizar and TPTP datasets with and without renaming. From the table it can be seen that disabling pattern renaming did not influence the system performance nor its generalization to TPTP or Mizar in a statistically meaningful way555A test for difference between renaming and no-renaming showed no differernce. These results suggest that the abstraction performed by indexing abstract patterns in clauses and the attention-based policy network are sufficient to allow generalization across problems, and we do not need renaming to get transfer.

4.3 Effectiveness of Trail

Baseline Comparisons: Beagle [Beagle2015]

is a well-established reasoner that provides competitive performance on ATP datasets. The current implementation of TRAIL uses Beagle as its underlying reasoner. This is purely an implementation choice, made mostly because Beagle is open source and could have its proof guidance removed and replaced with TRAIL. The purpose of Beagle in TRAIL is purely to execute the actions selected by the TRAIL learning agent; i.e., Beagle’s proof guidance was completely disabled when it was embedded as a component in TRAIL. TRAIL is not reasoner-dependent, and any reasoner that can apply FOL inference rules can serve the same role as Beagle in TRAIL. We support this claim with an experiment in the following section that shows similar performance gains when Beagle is substituted with a different (in-house) reasoner.

The purpose of the baseline experiments are to ensure that proof guidance controlled by TRAIL functions at a competitive level with a manually-optimized reasoner. We therefore compare TRAIL with (1) Beagle with all its own optimizations enabled for proof guidance, and (2) performance cited by  [KalUMO-NeurIPS18-atp-rl]; which is the system most similar to ours (an RL based ATP system).

Beagle (Fully Optimized) 61.0 32.4
Trail 57.0 24.3
Table 4: Percentage of problems solved on Mizar and TPTP datasets at testing.

Table 4 shows the percentage of problems solved by TRAIL compared to Beagle [Beagle2015]. TRAIL’s performance in Table 4 is the average of the three regimens. Beagle’s optimized strategy (heuristics-based) managed to solve 61% of the Mizar problems and 32.4% of TPTP problems. We note that the performance reported by TRAIL is at least comparable for a very recent RL-based ATP system [KalUMO-NeurIPS18-atp-rl]. Although a direct comparison with TRAIL cannot be made due to different time limits and hardware,  [KalUMO-NeurIPS18-atp-rl] reported 90% problems solved by Vampire in the same setting for which they got 50%, suggesting that the baseline performance of TRAIL is good.

Figure 4 shows how the average number of steps required to find a proof improves over iterations. For each problem, this score is calculated as the number of proof steps by Beagle divided by the number of steps used by TRAIL to find the proof. This score is computed on problems solved by both systems. A score greater than one means TRAIL finds more efficient and shorter proofs compared to Beagle. All training regimens started from the same initial model weights; hence they have the same performance at iteration 1. After the first model update, the performance of all models improved significantly. These scores continued to improve over the iterations, almost matching Beagle for tabula rasa and slightly above Beagle for exploratory imitation at the last iteration.

Figure 4:

Validation: Average number of steps required to find a proof relative to Beagle on Mizar (with standard error bars).

Reasoner Agnosticism: TRAIL is a reasoner-agnostic system; i.e., one can use any reasoner after disabling the reasoner’s own proof guidance strategy, as long as this reasoner can execute the actions proposed by TRAIL successfully. To demonstrate that this is indeed the case, we also integrated a baseline reasoner (Basic) in TRAIL. Basic is an in-house reasoner that implements some of the more common state-of-the-art optimization techniques such as subsumption checking, demodulation, and term indexing. The fully optimized version of Basic could solve 13.6% of Mizar problems. TRAIL integrated with Basic could solve only 3% of these problems prior to any training, however it improved to 15.6% after 30 iterations; a 2% improvement over Basic’s fully optimized version. Our goal here was only to demonstrate the generality of the TRAIL system architecture, and to show that the results of prior sections were not due to the choice of a particular reasoner like Beagle.

5 Related Work

A coarse overview of learning approaches for ATPs was given in the introduction. Several approaches focus on the (sub)problem of premise selection (i.e., finding the axioms relevant for the proof of the considered problem) [AlamaHKTU-jar14-premises-corpus-kernel, Blanchette-jar16-premises-isabelle-hol, Alemi-NIPS16-deepmath, WangTWD-NIPS17-deepgraph]. As is the case with automated theorem proving, the majority of approaches are based on manual heuristics and traditional machine learning; a few recent approaches are neural [Alemi-NIPS16-deepmath, WangTWD-NIPS17-deepgraph]. Since humans still often outperform fully automated systems, there has also been research on using learning to support interactive theorem proving [Blanchette-jar16-premises-isabelle-hol, Bancerek-JAR2018-mml].

Some early research has applied (deep) RL for guiding inference [TaylorMSW-FLAIRS07-cyc-rl], planning, and machine learning techniques for inference in relational domains [surveyRLRD]. Several papers consider propositional logic or other decidable FOL fragments and thus focus on much simpler algorithms than we do. Close to TRAIL is the approach described in [KalUMO-NeurIPS18-atp-rl]. It applies RL combined with Monte-Carlo tree search (MCTS) for automated theorem proving in FOL, but has some key limitations: 1) The input axioms are represented by features that depend on the vocabulary (i.e., user-defined predicates etc.). As a result, the approach does not generalize well to new problems with a different vocabulary. 2) The approach is specific to tableau-based reasoners and therefore may present difficulties for theories containing many equality axioms, which are better handled in the superposition calculus [bachmair1994refutational]. 3) It relies upon simple linear learners and gradient boosting as policy and value predictors.

Non-RL based approaches using deep-learning to guide proof search include [JU-CoRR19-enigma-hammering, ChvaJaSU-CoRR19-enigma-ng, LoosISK-LPAR17-atp-nn, paliwal2019graph]

. Each of the listed works would use a neural network during proof guidance to rank the list of available clauses with respect to

only the conjecture. Their underlying theorem prover would then expand proof search from the highest ranked clause. This ranking scheme was recognized as a limitation in [piotrowski2019guiding], where the authors described how such a methodology would fail to capture any dependencies between non-conjecture clauses. Their proposed solution was an RNN based encoding scheme for embedding entire proof branches in a tableau-based reasoner. The choice of a tableau reasoner was due to the relative compactness of tableau proof branches, which helped to keep the RNN from incorrectly discarding information. It was unclear how to extend their method to saturation-based theorem provers, where a theorem prover state may include thousands of irrelevant clauses.

Our work also aligns well with the recent proposal of an API for deep RL based interactive theorem proving in HOL Light, using imitation learning from human proofs [BaLoRSWi-CoRR19-holist]. That paper also describes an ATP as a proof-of-concept. However, their ATP is intended as a baseline and lacks more advanced features like our exploratory learning.

Our work is also inspired by AlphaZero [alphazero] and prior work on games and tree search [thinkingfastandslow, alphazero, Silver-nature16-go, Silver-nature17-go] where we randomly explore the search space as AlphaZero (but without MCTS) for the tabula rasa training.

6 Conclusions and Future Work

We presented TRAIL: a flexible, deep reinforcement learning based proof guidance strategy that transfers well across FOL problem domains. The transfer was robust across training regimens and changes in the vocabulary of the problems. A next logical step is to see if training transfers across logical formalisms.