Towards Proof Synthesis Guided by Neural Machine Translation for Intuitionistic Propositional Logic

06/20/2017 ∙ by Taro Sekiyama, et al. ∙ 0

Inspired by the recent evolution of deep neural networks (DNNs) in machine learning, we explore their application to PL-related topics. This paper is the first step towards this goal; we propose a proof-synthesis method for the negation-free propositional logic in which we use a DNN to obtain a guide of proof search. The idea is to view the proof-synthesis problem as a translation from a proposition to its proof. We train seq2seq, which is a popular network in neural machine translation, so that it generates a proof encoded as a λ-term of a given proposition. We implement the whole framework and empirically observe that a generated proof term is close to a correct proof in terms of the tree edit distance of AST. This observation justifies using the output from a trained seq2seq model as a guide for proof search.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) saw a great success and have become one of the most popular technologies in machine learning. They are especially good at solving problems in which one needs to discover certain patterns in problem instances (e.g., image classification [13, 26, 30], image generation [12, 9], and speech recognition [8, 11]).

Compared to the huge success in these problems, their application to PL-related problems such as program synthesis and automated theorem proving is, in spite of recent progress [5, 32, 10, 2, 20], yet to be fully explored. This is partly because the following gap between the PL-related areas and the areas where DNNs are competent:

  • The output of a DNN is not guaranteed to be correct; its performance is often measured by the ratio of the correct responses with respect to the set of test data. However, at least in the traditional formulation of PL-related problems, the answer is required to be fully correct.

  • It is nontrivial how to encode an instance of a PL-related problem as an input to a DNN. For example, program synthesis is a problem of generating a program from its specification . Although typical representations of and are abstract syntax trees, feeding a tree to a DNN requires nontrivial encoding [20].

This paper reports our preliminary experimental result of the application of DNNs to a PL-related problem, proof synthesis of the intuitionistic propositional logic. Proof synthesis leads to solve problems of automated theorem proving, which is one of the most fundamental problem in the theory of PL and has long history in computer science. Automated theorem proving is also an important tool for program verification, where the correctness of a program is reduced to the validity of verification conditions. It is also of interest for program synthesis because automated proof synthesis can be seen as an automated program synthesis via the Curry–Howard isomorphism [27].

Concretely, we propose a procedure that solves the following problem:


A proposition of the propositional logic represented as an AST;111Currently, we train and test the model with propositions of the negation-free fragment of this logic.


A proof term of represented as an AST of the simply typed -calculus extended with pairs and sums.

One of the main purposes of the present work is to measure the baseline of the proof synthesis with DNNs. The present paper confirms how a “vanilla” DNN framework is smart for our problem. As we describe below, we observed that such an untuned DNN indeed works quite well.

In order to apply DNNs to our problem as easily as possible, we take the following (admittedly simple-minded) view: proof synthesis can be viewed as translation from a proposition to a proof term. Therefore, we should be able to apply neural machine translation, machine translation that uses a deep neural network inside, to the proof-synthesis problem, just by expressing both a proposition and a proof as sequences of tokens.

We adopt a sequence-to-sequence (seq2seq) model [6, 28, 1], which achieves good performance in English–French translation [28], for the proposition–proof translation and train it on a set of proposition–proof pairs. Although the trained model generates correct proofs for many propositions (see Table 3 in Section 5; the best model generates correct proofs for almost half of the benchmark problems), it sometimes generates (1) a grammatically incorrect token sequence or (2) an ill-typed response. As a remedy to these incorrect responses, our procedure postprocesses the response to obtain a correct proof term.

Figure 1: An overview of the proof-synthesis system.

Figure 1 overviews our proof-synthesis procedure. We explain the important components:

  • The core of our proof-synthesis method is the neural network , which takes the token-sequence representation of a proposition as an input. is trained to generate a proof term of the given proposition; therefore, the output of from is expected to be “close” to the token-sequence representation of a correct proof term of .

  • The generated token sequence may be grammatically incorrect. In order to compute a parse tree from such an incorrect sequence, we apply Myers’ algorithm [21] that produces a grammatically correct token sequence that is closest to in terms of the edit distance.

  • Using the obtained parse tree as a guide, our procedure searches for a correct proof of . To this end, we enumerate parse trees in the ascending order of the tree edit distance proposed by Zhang et al. [33]. In order to check whether an enumerated tree is a correct proof term of , we pass it to a proof checker. In the current implementation, we translate to a Haskell program and typecheck it using Haskell interpreter GHCi. If it typechecks and has , then, via the Curry–Howard isomorphism, we can conclude that is a correct proof term of .

We remark that our proof-synthesis procedure is not complete. Indeed, it does not terminate if a proposition that is not an intuitionistic tautology is passed. We do not claim that we propose the best proof-synthesis procedure, because a sound and complete proof-synthesis algorithm is known in the intuitionistic logic [27]. The purpose of the present work is rather exploratory; we show the possibility of DNNs, especially neural machine translation, for the problem of automated theorem proving.

The rest of the paper is organized as follows. Section 2 defines the target logic as a variant of the simply typed -calculus; Section 3 explains the sequence-to-sequence neural network which we use for proof synthesis; Section 4 presents the proof-synthesis procedure; Section 5 describes the experiments; Section 6 discusses related work; and Section 7 concludes.

2 Language

This section fixes the syntax for propositions and proof terms. Based on the Curry–Howard isomorphism, we use the notation of the simply typed -calculus extended with pairs and sums. We hereafter identify a type with a proposition and a -term with a proof.

Figure 2: Syntax.






Figure 3: Typing rules.

Figure 2 shows the syntax of the target language. We use metavariables for variables. The target language is an extension of the simply typed -calculus with products, sums, and holes. We use a hole in the synthesis procedure described later to represent a partially synthesized term. Since the syntax is standard (except holes), we omit an explanation of each construct. We also omit the dynamic semantics of the terms; it is not of interest in the present paper. Free and bound variables are defined in the standard way: a lambda abstraction binds in ; a case expression for pairs binds and in ; a case expression for sums binds and in and , respectively. We identify two -equivalent terms as usual. The size of a term is the number of its AST.

A typing context is a set of bindings of the form . It can be seen a partial function from variables to types. The typing judgment is of the form and asserts that has type under the context ; the Curry–Howard isomorphism allows it to be seen as a judgment asserting is a proof of under the assumptions in . Figure 3 shows the typing rules. Holes may have any type (T-Hole); the other rules are standard.

3 Sequence-to-sequence neural network

We use the sequence-to-sequence (seq2seq) network as a neural network to translate a type to its inhabitant. This section reviews seq2seq briefly; interested readers are referred to Sutskever et al. [28] for details. We also assume basic familiarity about how a neural network conducts an inference and how a neural network is trained.

In general, application of a (deep) neural network to a supervised learning problem consists of two phases: training and inference. The goal of the training phase is to generate a model that approximates the probability distribution of a given dataset called training dataset. In a supervised learning problem, a training dataset consists of pairs of an input to and an expected output from the trained DNN model. For example, training for an image recognition task approximates likelihood of classes of images by taking images as inputs and their classes as expected outputs.

Training seq2seq

model approximates conditional probability distribution

where is an input and is an output sequence. After training, the trained model can be used to predict sequences from inputs in the inference phase.

Figure 4: seq2seq that takes input sequence and produces output sequence .
Figure 5: Unfolding a LSTM unit for input sequence .

An overview of the inference with a seq2seq model is shown in Figure 4, where is an input and is an output sequence. For each , seq2seq performs the following procedure.

  1. is converted to a one-hot vector, which is a

    matrix ( is the number of vocabularies used in a dataset) where all cells are except that the cell for is .

  2. The one-hot vector is converted to a matrix by the word embedding [4, 19] (Layer 1), which compresses sparse, high-dimensional vector representations of words to dense, low-dimensional matrices.

  3. The output of word embedding is processed by 2 Long Short-Term Memory (LSTM) units 

    [14] (Layers 2–3). LSTMs form a directed circular graph and will be unfolded by following the length of an input sequence, as in Figure 5. They take an input and the previous state, which is a matrix that has the information of the past inputs, and apply matrix operations to produce the output and the state for the next input. LSTMs can conduct a stateful inference; future outputs can depend on past inputs. This property is important for learning with time-series data such as sentences. In our system, the initial state is the zero vector.

  4. Finally, the output from the second LSTM is converted to a vector with elements at the fully connected layer (Layer 4), and the vector is translated to a token that is most likely.

In Figure 4, the snapshot of a model at an instant is aligned vertically. These snapshots are aligned horizontally from left to right along the flow of time. An input to a seq2seq model is a sequence of data , each of which is encoded as a one-hot vector. The input is terminated with a special symbol <EOS>, which means the end of the sequence. The response from the model for the symbol <EOS> is set to the first element of the output sequence. An output element is fed to the model to obtain the next output element until the model produces <EOS>.

The LSTM layers work as encoders while the model is fed with an input sequence . They work as decoders after the model receives <EOS> and while the model produces an output sequence .

Since inputs to and outputs from a seq2seq model are sequences, in order to apply seq2seq to the proof-synthesis problem, we need a sequence representation of a type. As the sequence representation of a type , we use the token sequence provided by a Haskell interpreter GHCi; this representation is written . For example, is ( “”, “”, “(”, “”, “”, “”, “)”, “”, “”). This choice of the token-sequence representation is for convenience of implementation; since we use GHCi as a type checker, using token sequences in GHCi is convenient. We train seq2seq so that, when it is fed with the GHCi token-sequence representation of a type, it outputs the token-sequence representation of a GHCi term. We also write for the GHCi token-sequence representation of term .

Transforming outputs from seq2seq to terms is a tricky part because an output of a seq2seq model is not always parsable as a term, as we see in Section 5. Our synthesis procedure in Section 4 corrects parsing errors and finds the nearest parsable token sequence by Myers’ algorithm [21].

We hereafter use a metavariable for a trained seq2seq model; we write for the sequence that infers from the input sequence .

4 Program Synthesis Procedure

1:procedure Synthesis()
2:      Feed seq2seq with the given type
3:      Parse and obtain a guide term
4:      The empty heap of closed terms of (ordered by )
5:      Proof search starts with the hole term
6:     loop Search guided by
7:         repeat
9:         until find term that has not been investigated yet
10:         if  contains no holes then
11:              return
12:         else
13:              for each  do
15:              end for
16:         end if
17:     end loop
18:end procedure
Procedure 1 Synthesis

Procedure 1 is the pseudocode of the procedure that takes a type and generates a closed, hole-free term of . The procedure uses a trained seq2seq model ; it is in advance trained to generate an inhabitant of a type.

Line 2 feeds with and obtains a token sequence that is expected to be close to an inhabitant of . This generated may be incorrect in the following two senses: (1) it may not be parsable (i.e., there may not be such that ) and (2) even if such exists, may not be an inhabitant of . fills these two gaps by postprocessing the output from seq2seq with the following two computations:

Guide synthesis

Line 3 calls procedure NearestTerm that computes such that the edit distance between and is smallest. NearestTerm uses Myers’ algorithm [21]. The output term from NearestTerm is called a guide term.

Guided proof search

Lines 417 enumerate candidate terms and test whether each candidate is a proof term of . In order to give higher priority to a candidate term that is “closer” to guide term , the procedure designates a priority queue . This queue orders the enqueued terms by the value of a cost function . The cost function is a parameter of the procedure ; it is defined so that the value of is smaller if the tree edit distance [33] between and is smaller. We present the definition of the cost functions that we use later. The enumeration of the candidate terms is continued until encounters a correct proof of . Although it is not guaranteed that this procedure converges,222If does not have an inhabitant, then indeed diverges. experiments presented in Section 5.5 indicate that discovers a proof fast in many cases compared to a brute-force proof search.

Figure 6: Shallow contexts.

The remaining ingredient of the guided proof-search phase (Lines 417) is the subprocedure that generates candidate terms. takes two parameters: term which may contain several holes and type of candidate terms to be synthesized. generates the set of the terms that are obtained by filling a hole in with a shallow context , a depth-1 term with holes, which is defined in Figure 6. Concretely, constructs a set of candidate terms by the following steps: (1) constructing the set such that if and only if is obtained by filling a hole in with an arbitrary shallow context in which only the variables bound at the hole can occur freely;333Since we identify -equivalent terms, the number of the shallow contexts that can be filled in is finite. and (2) filtering out from the terms that contain a -redex (to prune the proof-search space) or do not have type .

5 Preliminary Experiments

In order to confirm the baseline of the proof synthesis with seq2seq and the feasibility of our proof-synthesis method, we train seq2seq models, implement , and measure the performance of the implemented procedure.

5.1 Environment

We implemented Synthesis in Python 3 (version 3.6.1) except for NearestTerm, which is implemented with OCaml 4.04.1, and the type checker, for which we use Haskell interpreter GHCi 8.0.1; we write for the type of inferred by the GHCi. Training of seq2seq and inference using the trained seq2seq

models are implemented with a deep learning framework Chainer (version 1.24.0) 

[31];444We used the code available at with an adaptation. as the word2vec module, we used one provided by Python library gensim (version 0.13.4). We conduct all experiments on an instance provided by Amazon EC2 g2.2xlarge, equipped with 8 virtual CPUs (Intel Xeon E5-2670; 2.60 GHz, 4 cores) and 15 GiB memory. Although the instance is equipped with a GPU, it is not used in the training phase nor the inference phase.

As shown in Figure 4, our seq2seq network consists of 4 layers: a layer for word embedding, 2 LSTMs, and a fully connected layer.

 Layer type The number of parameters
 Word embedding
 LSTM 20 K
 LSTM 20 K
 Fully connected layer
Table 1: Learnable parameters in the network: is the number of the vocabularies.

Their learnable parameters are shown in Table 1, where is the number of vocabularies used in the dataset. The value of depends on the token-sequence representation of training data; in the current dataset, it is . The parameters in the word embedding are initialized by word2vec [19]; we used CBOW with negative sampling; the window size is set to 5. The weights of the LSTMs and the last fully connected layer are initialized by i.i.d. Gaussian sampling with mean 0 and deviation

(the number 50 is the output size of the previous layer of each); the biases are initialized to 0. We trained the model with stochastic gradient descent. The loss function is the sum of the softmax cross entropy accumulated over the token sequence. As an optimizer, we use Adam 

[16] with the following parameters: , , , and .

5.2 Generating dataset

1:procedure TrainingDataset() Make pairs of a type and its term
3:     while  do
4:          choose from at uniformly random
5:          generate a closed, hole-free, well-typed term of size at random
7:         if  for some  then
8:              if  then
10:              end if
11:         else
13:         end if
14:     end while
15:     return
16:end procedure
Procedure 2 Generation of training dataset

In order to train the network, we need a set of type–term pairs as training data. Since we are not aware of publicly available such data, we used data generated by Procedure 2 for this purpose. This procedure first uniformly chooses an integer from 1 to 9, uniformly samples a term from the set of the size- terms, and adds it to the dataset.555We also conducted an experiment with a dataset that only consists of -normal terms; see Section 5.3. If already contains a term of the type of , the smaller one is assigned to ; otherwise, is added to . Models are trained on a training set that consists of 1000 pairs of a type and a closed hole-free term of .

We do not claim that a dataset generated by is the best for training. Ideally, a training dataset should reflect the true distribution of the type–term pairs. The current dataset does not necessarily reflect this distribution in that (1) it ignores a proof with size more than and (2) it gives higher preference to smaller proofs.666We observed that the number of well-typed terms grows exponentially with respect to the size of a term; therefore, if we uniformly sample training data from the set of well-typed terms, a term with smaller size is rarely included in the dataset. By first fixing the size and then uniformly sampling the term of size , we give relatively higher preference to smaller-size terms. By using the repository of hand-written proofs as the training dataset, we could approximate this distribution, which is an important future direction of our work.

5.3 Training

We train the network using two datasets: generated by Procedure 2 and generated in the same way but contains only normal forms. We trained the network not only with but with because a proof term with a -redex can be seen as a detour from the viewpoint of the proof theory [27]; therefore, we expect that the model trained with generates a guide term that is more succinct than one trained with

. We used the following batch sizes in the training in each training dataset: 16, 32, and 64. Each model is trained for 3000 epochs.

Figure 7: Smoothed plots of the training loss over epochs: the left graph shows the plots for the models trained with ; the right graph shows the plots for the models trained with ; each graph contains the plots for different batch sizes (BS).

Figure 7 shows the smoothed plots of the training loss over epochs. Since a loss represents the difference between expected outputs and actual outputs from the trained model, these graphs mean that the training of each model almost converges after 3000 epochs.

Model Inferred term from
Dataset Batch size
Table 2: Examples of terms inferred by trained seq2seq models.

Table 2 shows examples of terms inferred by trained models from type , which denotes swapping components of pairs. All terms shown in Table 2 capture that they should take a pair (), decompose it by a case expression, and compose a new pair using the decomposition result. Unfortunately, they are not correct proof terms of the designated type: terms in the first, second, fifth, and sixth rows refer to incorrect variables; and ones in the third and fourth rows decompose the argument pair only for making the first element of the new pair. On the other hand, they somewhat resemble to a correct proof term (e.g., ). Our synthesis procedure uses a generated (incorrect) proof term to efficiently search for a correct one.

5.4 Evaluation of the trained models

We quantitatively evaluate our models on the following aspects:


How many inferred strings are parsable as proof terms?


How many inferred terms, after the postprocessing by NearesrtTerm, are indeed correct proofs?


How close is an inferred proof postprocessed by NearesrtTerm to a correct proof in average?

We measure the closeness by tree edit distance [33]; we measure the edit distance between an inferred term and a correct proof term that is the closest to the inferred term and whose size is not more than .

We generated terms using the trained models with a test dataset that consists of 1000 types sampled by the similar way to Procedure 2 but does not contain any type in nor . The evaluation results of the quantities above are shown in Table 3. We discuss the result below.

 Model  Dataset
 Batch size 16 32 64 16 32 64
 Evaluation  # of parsable 983 987 987 991 988 990
 # of typable 430 510 463 515 475 451
 Rate of misuse of vars (%) 39.82 30.61 42.27 28.45 33.90 30.78
 Closeness per AST node 0.1982 0.1805 0.1831 0.1878 0.1822 0.2001
Table 3: Evaluation of the trained models.
  • Every model generates a parsable response in more than 980 propositions out of 1000. This rate turns out to be high enough for the call to NearestTerm in the synthesis procedure to work in reasonable time.

  • As for the number of typable responses, the number is between 430 to 515 out of 1000 depending on the training data and the batch size. We observed that the error is often caused due to the misuse of variables. For example, as shown in Table 2, is inferred as a proof term for the proposition . Although this term is incorrect, this term is made correct by replacing the first reference to with that to . The row “Rate of misuse of vars” in Table 3

    is the rate of such errors among the whole erroneous terms. Given such errors are frequent, we guess that the combination of our method and premise selection heuristics is promising in improving the accuracy.

  • Closeness is measured by the average of the per-node minimum tree edit distance between a generated term (postprocessed by NearestTerm) and a correct proof term whose sizes are not more than . The precise definition is

    where is a type for the -th test case; is a set of closed hole-free terms of type whose sizes are not more than ; and . We can observe that we need to edit about 19% of the whole nodes of a term generated by the models in average to obtain a correct proof. We believe that this rate can be made less if we tune the network.

5.5 Evaluation of the synthesis procedures

We evaluate Procedure 1 with several instantiations of the cost function. In the definition of the cost functions, we use auxiliary function , which is defined as follows:

The function matches the term that contains holes against the guide term ; if is a hole, then is returned as the result. If the top constructor of is different from , then it returns . This function is used to express different treatment of a hole in computation of tree edit distance.

We use the following cost functions computed from a candidate term and a guide term :

  • that does not take the edit distance between and . Since this function ignores the guide term, Procedure 1 instantiated with this cost function searches for a correct proof term in a brute-force way.

  • , where is the size of and is the tree edit distance between and .

  • . Although this function also takes the edit distance between and into account in the cost computation, does not count the distance between a hole in and the corresponding subtree in , while in counts the distance between a hole in and the corresponding subtree in .

We call for the Procedure 1 instantiated with , for one instantiated with , and for one instantiated with . Since treats the difference between a hole in a candidate and the corresponding subtree of guide term as cost , is expected to be more aggressive in using the information of the guide term. By comparing against and , we can discuss the merit of using neural machine translation with respect to the brute-force search.

Procedure Sum ED-0 ED-1 ED-2 ED-3 ED-4 ED-5 ED-6 ED-7 ED-8 ED-10
1754.70 1.41 1.37 1.47 3.44 3.96 21.46 264.82 37.10 272.25 N/A
7546.57 1.46 1.80 1.91 5.53 5.21 29.99 1681.14 30.46 296.36 N/A
1336.19 1.49 1.51 11.52 1.98 8.10 35.17 59.14 N/A 243.59 N/A
2142.12 1.58 1.96 37.18 2.82 11.32 47.77 110.50 N/A 412.64 N/A
1173.34 1.38 1.69 1.95 3.41 8.86 30.35 40.37 N/A 425.25 N/A
3420.96 1.29 1.54 1.96 3.31 5.16 32.03 45.99 N/A 2688.28 N/A
1587.47 1.44 1.61 2.10 3.19 21.52 3.57 81.98 247.83 N/A 1.96
2461.17 1.51 1.87 2.10 5.16 33.90 11.57 47.72 279.12 N/A 835.55
1308.47 1.49 1.86 3.08 4.15 1.74 13.95 102.53 3.42 N/A N/A
3316.54 1.41 1.90 3.57 4.13 1.94 17.52 299.16 10.43 N/A N/A
567.61 1.31 1.50 1.85 1.98 4.73 8.30 82.06 N/A 37.18 N/A
640.44 1.20 1.55 2.04 2.16 6.97 10.02 90.30 N/A 38.67 N/A
Table 4: Running time of the synthesis procedure (in seconds). The column “Procedure” presents the procedure name with the used training set and the batch size. The column “Sum” presents the the sum of running time for 100 test cases. The column ED- presents the average of running time for the test cases in which the edit distance between a guide term and the found proof term is . If a cell in the column ED- is marked N/A, it means that there was no test case in which the edit distance between a guide term and the found proof term was . does not have data in the columns ED- since it ignores the guide term.

We generated a test dataset that consists of 100 types in the same way as in Section 5.3 for evaluation. We measured the running time of each procedure with the models trained on different training datasets and with different batch sizes. The result is shown in Table 3. crashed due to a run-time memory error in the 42nd test case; the value of the Sum column for in Table 3 reports the sum of the running time until the 41st test case.


The two DNN-guided procedures and are much faster than . This indicates that guide terms inferred by the trained models are indeed useful as a hint for proof search.

Comparing the synthesis procedures with models trained with different datasets (i.e., and ), we can observe that the models trained with often makes the synthesis faster than the models trained on . This accords to our expectation. Although the result seems to be also largely affected by the batch size used in the training phase, inspection about the relation between the batch size and the performance of the synthesis procedure is left as future work.

is in general slower than especially in the cases where the edit distance is large. We guess that this is due to the following reason. first explores a term that is almost the same as the inferred guide term since calculates edit distances by assuming that holes of proof candidates will be filled with subterms of the guide term. This strategy is risky because it wastes computation time if the distance between the guide term and a correct proof is large. The result in Table 3 suggests that the current models tend to infer a term such that it contains more errors if it is larger. This inaccuracy leads to the worse performance of the current implementation of . We think that becomes more efficient by improving the models.

The current implementation of explicitly computes in the computation of the cost function. This may also affect the performance of . This could be improved by optimizing the implementation of .

To conclude the discussion, the guide by neural machine translation is indeed effective in making the proof search more efficient. We expect that we can improve the performance by improving the accuracy of the neural machine translation module.

6 Related Work

Loos et al. [17] use a DNN to improve the clause-selection engine of the Mizar system [3]. In their work, the input to the network is a set of unprocessed clauses and the negation of a goal to be refuted; it is trained to recommend which unprocessed clause to be processed. They report that their architecture improves the performance of the proof-search algorithm. Our work shares the same goal as theirs (i.e., DNN-assisted theorem proving) but tackles in a different approach: they use a DNN for improving heuristics of an automated theorem prover, whereas we use a DNN for translating a proposition to its proof. They observe that the combination of a DNN and the conventional proof search is effective in expediting the overall process, which parallels the design decisions of our proof synthesis, which uses the proof suggested by a DNN as a guide for proof search.

As we mentioned in Section 1, a proof-synthesis procedure can be seen as a program-synthesis procedure via the Curry–Howard isomorphism. In this regard, the DNN-based program synthesis [10, 2] are related to our work. Devlin et al. [10] propose an example-driven program-synthesis method for string-manipulation problems. They compare two approaches to DNN-based program learning: neural synthesis, which learns a program written in a DSL from input/output examples, and neural induction, which does not explicitly synthesize a program but uses a learned model as a map for unknown inputs. Balog et al. [2] propose a program-synthesis method for a functional DSL to manipulate integer lists. Their implementation synthesizes a program in two steps as we do: a DNN generates a program from a set of input–output pairs; then, the suggested program is modified for a correct program. Both of Devlin et al. and Balog et al. study inductive program synthesis that generates a program from given input–output examples, while our work corresponds to program synthesis from given specifications. The state-of-the-art program synthesis with type specifications [23] is generating programs from liquid types [25], which allow for representing a far richer specification than STLC. We are currently investigating whether our proof-as-translation view is extended to a richer type system (or, equivalently, a richer logic).

7 Conclusion

We proposed a proof-synthesis procedure for the intuitionistic propositional logic based on neural machine translation. Our procedure generates a proof term from a proposition using the sequence-to-sequence neural network. The network is trained in advance to translate the token-sequence representation of a proposition to that of its proof term. Although we did not carefully tuned the network, it generates correct proofs to the almost half of the benchmark problems. We observed that an incorrect proof is often close to a correct one in terms of the tree edit distance. Based on this observation, our procedure explores a correct proof using the generated proof as a guide. We empirically showed that our procedure generates a proof more efficiently than a brute-force proof-search procedure. As far as we know, this is the first work that applies neural machine translation to automated theorem proving.

As we mentioned in Section 1, one of the main purposes of the present paper is to measure the baseline of DNN-based automated proof synthesis. The result of our experiments in Section 5 suggests that the current deep neural network applied to automated proof synthesis can be trained so that it generates a good guess to many problems, which is useful to make a proof-search process efficient.

We believe that this result opens up several research directions that are worth being investigated. One of these directions is to extend the target language. Although we set our target to a small language (i.e., the intuitionistic propositional logic) in the present paper, automated proof synthesis for more expressive logic such as Calculus of Construction [7] and separation logic [24, 15] is useful. In an expressive logic, we guess that we need more training data to avoid overfitting. To obtain such large amount of data, we consider using an open-source repository of the proofs written in the language of proof assistants such as Coq [18] and Agda [22].

Another future direction is to improve the performance of the neural machine translation. In general, the performance of a deep neural network is known to be largely affected by how well the network is tuned. Besides the network itself, we guess that the performance may be improved by tuning the representation of propositions and proofs. For example, we used the infix notation to represent a proposition (e.g., for an implication), although a proof term for an implication is an abstraction . If we represent an implication in the postfix notation (i.e., ), then the symbol in the proposition and the symbol in the proof term comes closer in a training data, which may lead to a better performance of sequence-to-sequence networks as is suggested by Sutskever et al. [28].

The current proof-search phase uses several variants of cost functions to prioritize the candidates to be explored. By tuning this function, we expect that we can make the synthesis procedure faster. We are currently looking at the application of reinforcement learning 

[29] to automatically search for a cost function that leads to a good performance of the overall synthesis procedure.


We would like to thank Takayuki Muranushi for making a prototype implementation of the early proof synthesizer without DNNs. We also appreciate Akihiro Yamamoto; the discussion with him leads to the evaluation metrics used in this paper. This paper is partially supported by JST PRESTO Grant Number JPMJPR15E5, Japan.