Log In Sign Up

Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text

Mathematical text is written using a combination of words and mathematical expressions. This combination, along with a specific way of structuring sentences makes it challenging for state-of-art NLP tools to understand and reason on top of mathematical discourse. In this work, we propose a new NLP task, the natural premise selection, which is used to retrieve supporting definitions and supporting propositions that are useful for generating an informal mathematical proof for a particular statement. We also make available a dataset, NL-PS, which can be used to evaluate different approaches for the natural premise selection task. Using different baselines, we demonstrate the underlying interpretation challenges associated with the task.


page 3

page 5


Learning to Match Mathematical Statements with Proofs

We introduce a novel task consisting in assigning a proof to a given mat...

NaturalProofs: Mathematical Theorem Proving in Natural Language

Understanding and creating mathematics using natural mathematical langua...

Probing the Natural Language Inference Task with Automated Reasoning Tools

The Natural Language Inference (NLI) task is an important task in modern...

PhysNLU: A Language Resource for Evaluating Natural Language Understanding and Explanation Coherence in Physics

In order for language models to aid physics research, they must first en...

Automated Discovery of Mathematical Definitions in Text with Deep Neural Networks

Automatic definition extraction from texts is an important task that has...

An Insight into The Intricacies of Lingual Paraphrasing Pragmatic Discourse on The Purpose of Synonyms

The term "paraphrasing" refers to the process of presenting the sense of...

Argumentation theory for mathematical argument

To adequately model mathematical arguments the analyst must be able to r...

1 Introduction

Comprehending mathematical text requires evaluating the semantics of its mathematical structures (such as expressions) and connecting its internal components with the respective definitions or premises [12].

State-of-the-art models for natural language processing, such as BERT 

[7], have high scores for several tasks, such as entity recognition, textual entailment and machine translation, but they do not encode the intricate mathematical background knowledge needed to reason over mathematical discourse.

The language of mathematics is composed of a combination of words and symbols, where symbols follow a different set of rules and have a specific alphabet. Nonetheless, word and symbols are interdependent in the context of mathematical discourse. This phenomenon is exclusive to mathematical language, not found in any other natural, or artificial, language [9], providing a unique and challenging application for semantic evaluation and natural language processing.

Understanding mathematical discourse has been explored before as a Mathematical Knowledge Extraction task [1]; however, several aspects of the mathematical discourse related to deeper and more granular reasoning over mathematical discourse has not yet been investigated. There is a lack of datasets in the literature needed for exploring and studying mathematical discourse and its associated interpretation and reasoning.

We propose the task of natural premise selection, inspired by the field of automated theorem processing. Premise selection appeared initially as a task of selecting a (useful) part of an extensive formal library in order to limit the search space for an Automated Theorem Proving (ATP) system, increasing the chance of finding a proof for a given conjecture [5]. Premises considered relevant are the ones that ATPs use for the automatic deduction process of finding a proof for a conjecture. The premise selection task is defined as: Given a collection of premises , an ATP system with given resource limits, and a new conjecture , predict those premises from that will most likely lead to an automatically constructed proof of by  [13].

Natural premise selection is based not on formally structured mathematics, but on human-generated mathematical text. It takes as input mathematical text, written in natural language and outputs relevant mathematical statements that could support a human in finding a proof for that mathematical text. The premises are composed by a set of supporting definitions and supporting propositions, that act as explanations for the proof process.

For example, the famous Fermat’s Little Theorem [19] has different possible proofs, one of them using the Euclid’s Lemma. In this example, Euclid’s Lemma would be considered useful for a human trying to prove Fermat’s Little Theorem; therefore, it is a premise for the conjecture that Fermat’s Little Theorem presents.

In order to evaluate this task, we propose a new dataset: NL-PS (Natural Language - Premise Selection), using as a basis the human-curated website ProofWiki111 This dataset opens possibilities of applications not only for the premise selection task but also for evaluating semantic representations for mathematical discourse (including embeddings), textual entailment for mathematics and natural language inference in the context of mathematical texts.

The contributions of this paper can be summarised as follows:

  • Proposal of a new NLP task: natural language premise selection.

  • A novel dataset, NL-PS, to support the evaluation of premise selection methods using natural language corpora.

  • Comparison of different baselines for the natural premise selection task.

2 Related Work

NLP has been applied before in the context of general Mathematics. chaganty2016much proposes a new task for semantic analysis, the task of perspective generation, i.e., generating description to numerical values using other values as reference. huang2016well analyze different approaches to solve mathematical word problems and concludes that it is still an unsolved challenge.

ganesalingam2017fully propose a program that solves elementary mathematical problems, mainly in metric space theory, and presents solutions similar to the ones proposed by humans. The authors recognize that their system is operating at a disadvantage because human language involves several constraints that rule out many sound and effective tactics for generating proofs.

wang2018first propose an approach to automatically formalize informal mathematics using statistical parsing methods and large-theory automated reasoning. The idea is to convert from an informal statement to a formal one, using Mizar as the output language. After the statement has been correctly translated, it can be checked using an automatic tool.

Naproche (Natural language Proof Checking) [6] is a project focused on the development of a controlled natural language (CNL) for mathematical texts and adapting proof checking software to work with this language in order to check syntactical and mathematical correctness.

zinn2003computational proposes proof representation structures to represent mathematical discourse using discourse representation theory and also proposes a prototype that could be used to automate the process of generating proofs.

Approaches for creating embeddings of mathematical text have applied variations of the Skip-gram model [16], extending it with a specific tokenization strategy for equations and mathematical terms. Most tokenization strategies will use the tree structure of an equation to define the target tokens and can range from considering the full equation [14] as a single token or decomposing its component expressions or at the individual symbol-level [10]

. Greiner-Petter2019WhyYet developed a skip-gram-based model using as a reference corpus a set of arXiv papers in HTML format using a term-level tokenization granularity. The authors found that the induced vector space did not produce meaningful semantic clusters.

Premise selection is an approach generally used for selecting useful premises to prove conjectures in Automated Theorem Proving (ATP) systems [3]

. irving2016deepmath propose a neural architecture for premise selection using formal statements written in Mizar. Other authors have used machine learning approaches such as Kernel-based Learning 

[2], k-NN algorithm [11]

and Random Forests 

[8]. However, the neural approaches previously presented [13] have obtained higher scores at the premise selection task.

3 Linguistic Considerations

In this section, we describe some of the linguistics features present in a mathematical corpus. Our aim is to examine its discourse in combination with natural language. The following definitions are not of mathematical objects since those already have established mathematical definitions; in this work, we are interested in how the different mathematical objects are presented inside the mathematical text.

Definition 1.

A mathematical expression , in a mathematical text, is defined by a set where , and is the set of symbols present in a certain mathematical domain of discourse, such as variables, constants and functions. A variable, for example, is considered an expression.

Definition 2.

An equation is defined as a combination of and an (in)equality predicate .

Definition 3.

A mathematical statement can be:

  • [noitemsep]

  • A sequence of words (from the mathematical domain) or;

  • A sequence of words and expressions and/or equations or;

  • A sequence of only equations.

Definition 4.

A mathematical text is a sequence of mathematical statements.

In the mathematical text, words, expressions and equations, can be directly related through a relationship of definiendum and definiens, where an expression, the definiendum, is defined by a mathematical statement or part of a mathematical statement, the definiens is also used to determine the set of values and properties associated with an expression.

Definition 5.

A mathematical definiens is the set of tuples composed by and the set of (part of) mathematical statements that declares and/or quantifies in the mathematical text . Figure 1 presents an example where the definiens and the definiendum are highlighted. A definiendum can have more than one definiens, for example, the expression “” is declared by the equation “”, and has the property “real function”. Therefore:

Let be a strictly positive real number.

Let be the real function defined as: where denotes to the power of .

Then is convex.

Figure 1: Theorem with three definiendums and six definiens, where the content inside the boxes are definiendums and the underlined content are definiens.

Different mathematical texts can also be related, since mathematical knowledge is often incremental, where one element depends on others. For example, in Figure 1, in order to understand the meaning of the presented text, we need to understand the definition of a real function, which is defined in another mathematical text.

Definition 6.

A mathematical supporting definition is the set of mathematical texts , where all elements in contains a definition of a concept presented in . For example, the theorem in Figure 1 is connected to the mathematical text that defines what is a real function.

Definition 7.

A definition is composed by a 4-tuple , where is the definition text, is the set of categories that the definition belongs to, is the set of definiens in the text and is the set definitions that is referenced in . If is empty, we call it an atomic definition.

A mathematical proof is a particular mathematical text that tries to convince the reader that a specific hypothesis can lead to a conclusion [17]. Proofs often contain mathematical bindings. They can also be connected to other propositions, such as lemmas, theorems and corollaries, as we will define next.

Definition 8.

A mathematical supporting proposition is the set of propositions that helps support the argument proposed in the mathematical text of a proof. It is often used as an explanation for certain statements used for the construction of the proof. Figure 2 presents part of the proof, where the name of the supporting facts is highlighted. For example, the mathematical statement of Cauchy’s Mean Theorem is a supporting fact for the proof shown.

[boxsep=-1.5mm] Let .

Note that, from Power of Positive Real Number is Positive: Real Number:



(Exponent Combination Laws)
(Exponent Combination Laws)
(Cauchy’s Mean Theorem)
Figure 2: Example of part of a proof, where four mathematical supporting facts are present.
Definition 9.

The set of premises of a mathematical text of a proof is the set of supporting facts and the set of supporting definitions , i.e., .

Definition 10.

A mathematical proof is defined is composed by a tuple , where is the proof text and is the set of premises for .

Definition 11.

A theorem is composed by a tuple ), where is the theorem’s text, is the set of categories that the theorem belongs to, is the set of definiens in the text, is the set of proofs for the theorem (one theorem can have more than one possible proof).

Definition 12.

Similarly, we can define a lemma . is composed by a 5-tuple . With the addition of , theorem where the lemma occurs.

Definition 13.

A corollary is composed by a 5-tuple , where is the theorem that derives .

4 Dataset Construction: NL-PS

In this section, we present our dataset, NL-PS, and detail the steps we took in order to construct it. Our dataset is available as a set of JSON files in A summary of the process is presented in Figure 3.

Figure 3: Pipeline used to build the NL-PS dataset.

Parsing the corpus

The proposed dataset was extracted from the source code of ProofWiki. ProofWiki is an online compendium of mathematical proofs, with a goal to collect and classify mathematical proofs. ProofWiki contains links between theorems, definitions and axioms in the context of a mathematical proof, determining which dependencies are present. ProofWiki is manually curated by different collaborators; therefore, there are different styles of mathematical text and many elements cannot be extracted automatically.

Cleaning wiki tags

ProofWiki has wikimedia tags; however, ProofWiki has also specific tags related to the mathematical domain. Therefore, we cannot use default wiki extraction tools. A bespoke tool was developed to comply with ProofWiki’s tagging scheme. For example, there is a particular tag for referring to another mathematical text, using passages from other texts in order to support a claim (Figure 4).

Figure 4: An example where Mathematical text 1 references a passage in Mathematical text 2 using the name of the passage to be referenced between curly brackets. Only the part highlighted is being referenced.

Proof curation

Several pages in ProofWiki are not directly related to mathematical propositions or definitions, such as users pages, help pages, and pages about specific talks. We manually analysed the pages and removed the ones that are not definitions, lemmas, theorems or corollaries. Some pages also contained tags to indicate that

Extraction of categories

ProofWiki has associated categories for each page. However, the categories are not harmonised across definitions and propositions. We merged different categories that belonged to the same mathematical branch and selected the categories that contained at least 100 different entries. The categories selected are: Analysis, Set Theory, Number Theory, Abstract Algebra, Topology, Algebra, Relation Theory, Mapping Theory, Real Analysis, Geometry, Metric Spaces, Linear Algebra, Complex Analysis, Applied Mathematics, Order Theory, Numbers, Physics, Group Theory, Ring Theory, Euclidean Geometry, Class Theory, Discrete Mathematics, Plane Geometry and Units of Measurement

Extracting supporting facts

The pages in ProofWiki are connected using hyperlinks. We leverage this structure to extract supporting propositions and supporting definitions. From the definition mathematical text, we extract the hyperlinks connecting to other definitions and these links are the supporting definitions. From the mathematical text of proofs, we extract the hyperlinks to other propositions and we consider these as supporting propositions. For example, Figure 5 presents a theorem and its respective proof. The proof contain links (highlighted) to other propositions, these are supporting propositions needed in order to support the proof.

Figure 5: Example of supporting propositions for a theorem.

Annotating mathematical text

The entries in ProofWiki are often divided in sections, for NL-PS, we are only interested in the sections that present a definition, a proposition or a proof. Proofs were curated (combining manual and automatic annotation) to contain only mathematic discourse, removing satellite discourse such as Historical Notes. Because some propositions can be proved in different ways, we also annotated the different proofs which can be found inside one single page.

Structuring the entries

Finally, the dataset is structured as follows:

  • Definitions entries are composed by a mathematical text and a set of supporting definitions.

  • Lemmas and Theorems have a mathematical text, a proof and a set of premises.

  • Corollaries are composed by a mathematical text, a proof, a set of premises and the theorem that derives the corollary.

5 Dataset Analysis

NL-PS has a total of 20,401 different entries, composed of definitions, lemmas, corollaries and theorems, as shown in Table 1.

Type Number of entries
Definitions 5,633
Lemmas 327
Corollaries 292
Theorems 14,149
Total 20,401
Table 1: Types of mathematical documents in NL-PS

Figure 6 presents the distribution of different categories in the dataset.

Figure 6: Distribution of documents per category in the dataset.

Figure 7 presents a histogram with the frequency of the different number of premises. We can observe that the statements usually have a small number of premises, with statements containing between one and five premises. The highest number of premises for one theorem is (text for the theorem “The Sorgenfrey line is Lindelöf.”).

Figure 7: Distribution of the number of premises in the ProofWiki corpus.

Similarly, the histogram in Figure 8 shows the frequency of the different number of dependencies.

We also computed how many times each statement is used as a premise, and we observed that most of the statements are used as dependencies for only a small subset of premises. A total of statements has between one and three dependants. On average, statements contain a total of symbols (characters and mathematical symbols). The specific number of tokens will depend on the type of tokenisation used for the mathematical symbols.

Figure 8: Number of times a statement is referred as a premise.

We can also represent the connections (premises) between different mathematical texts as a graph. This graph has a total of 14,393 nodes (the number of nodes is smaller than the number entries, since some of the entries are disconnected, and we do not consider those for the graph) and 34,874 edges.

The dataset provides a specific semantic modelling challenge for natural language processing as it requires specific tokenization, co-reference resolution and the modelling of specific discourse structures tailored towards mathematical text. One crucial challenge is how to resolve the semantics of variables in mathematical expressions, which requires a particular binding method. As shown in Figure 9, variables that refer to the same set can often have different names. For example, in the definition of sine, the variable being used is , but and refers to the same set. Basically, variables serve as a mathematical alternative to anaphora [9].

Figure 9: Variables with different symbols, but referring to the same set.

6 Experiments

In order to identify the challenges of the task of natural premise selection using NL-PS, we performed initial experiments using two baselines: TF-IDF and PV-DBOW [15]

. We use both techniques to create vector representations for all the mathematical texts. Then compute the cosine similarity between each entry and rank the results by proximity. We then compute the Mean Average Precision (MAP) for each baseline, ranking all possible premises, computed as:

where is the total number of documents, is the -th mathematical text and AveP is the average precision.

Table 2 presents the initial results. We compare three different types of tokenisations for the mathematical elements. Initially, we treat the expressions and equations as single tokens; for example, the expression “” would be considered a single word. We also considered tokenised expressions, tokenising operations and operators, the expression “” would be tokenised as [‘’,‘’,‘’,‘’,‘’]. Finally, we tokenise the whole text as a sequence of characters. We run PV-DBOW with the default parameters, comparing different sizes of embeddings, with the best results obtained with an embedding size of 100.

From these initial results, we can conclude that the task is semantically non-trivial and cannot be solved with retrieval strategies such as lexical overlap. We can also notice that we obtain better results when we tokenise the expressions, hinting that the elements inside the expressions have semantic properties that are relevant for determining the relevant premises. For the following experiments, we are using the tokenised expressions and PV-DBOW with an embedding size of 100.

50 100 200
Expression as words 0.073 0.048 0.051 0.046
Tokenised expressions 0.089 0.069 0.073 0.072
Char level 0.051 0.059 0.065 0.061
Table 2: MAP results for TF-IDF and PV-DBOW comparing tokenisation of expressions. We compare the results for PV-DBOW for different dimension values.

In Table 3 we compare the results for different sizes of the dataset. We consider the full dataset and three different subsets with different categories. We can notice that for smaller datasets, both baselines perform better. This result was expected since with smaller datasets there are less possible premises, and elements from the same categories tend to be more uniform between themselves.

All Categories 0.089 0.076
Algebra (1,241) 0.183 0.177
Analysis (1,102) 0.191 0.212
Number Theory (741) 0.242 0.188
Table 3: Comparing results for different categories (the number between parenthesis indicates the number of entries for that category).

We can also consider the fact that the premises are transitive, i.e., if one a mathematical text has a premise and a mathematical text has as a premise, then should also be a premise of . In this case, the task becomes even more challenging, as we present in Table 4, where we consider the transitivity with two and three hops of distance. From the results, we notice that the more hops needed to obtain the premise, the worse our baselines perform.

1-hop premises 0.089 0.073
2-hop premises 0.052 0.047
3-hop premises 0.038 0.031
Table 4: Comparing number of hops needed for obtaining premises.

We also verify on how state-of-the-art embedding models perform with such specific dataset. BERT [7] is reported to have performed in different NLP tasks, including understanding numeracy [18].

In order to use BERT, we formulate the problem as a pairwise relevance classification problem, where we aim to classify if one mathematical text is connected to another. We do not perform any pre-processing for the expressions.

For this experiment, we used the pre-trained BERT model bert-base-uncased and SciBERT [4] model scibert-scivocab-uncased, fine-tuning for our task with a sequence classifier, adding a linear layer on top of the transformer vectors. The results are presented in Table 5. Even though BERT is not pre-trained using a mathematical corpus, it performs better than TF-IDF and PV-DBOW. SciBERT perform slightly better than BERT, since it was trained in a scientific corpus, but not in a mathematical corpus. This hints that BERT trained from scratch in a mathematical corpus could have even better results, however, this is outside the scope of this work.

Model MAP
SciBERT 0.383
BERT 0.377
Table 5: Results for BERT and SciBERT.

7 Conclusion

In this paper we proposed a new task for mathematical language processing: natural language premise selection. We also made a new dataset available for the evaluation of the task and we analysed how the dataset works with the task on different baselines.

From our experiments we identified that handling mathematical symbols are crucial for solving the task, taking into consideration more specific semantics of operators and variables: such semantics are not captured using PV-DM or BERT. This provides evidence on the need for specific embeddings and representation for mathematical formulas and discourse, which could most certainly improve the prediction of future work in the natural language premise selection task.

We also identify that the task becomes more challenging when we consider that the premises are transitive, suggesting that the task could benefit from graph-based representations.

Our dataset can be used in a different set of natural mathematical reasoning tasks, aiding researchers on the creation of mechanisms for improving the way machines understand mathematical text.

8 Acknowledgements

The authors would like to thank the anonymous reviewers for the constructive feedback.

9 Bibliographical References


  • [1] A. Aizawa, M. Kohlhase, I. Ounis, and M. Schubotz (2014) NTCIR-11 math-2 task overview.. In NTCIR, Vol. 11, pp. 88–98. Cited by: §1.
  • [2] J. Alama, T. Heskes, D. Kühlwein, E. Tsivtsivadze, and J. Urban (2014-02-01) Premise selection for mathematics by corpus analysis and kernel methods. Journal of Automated Reasoning 52 (2), pp. 191–213. External Links: ISSN 1573-0670, Document, Link Cited by: §2.
  • [3] J. Alama, T. Heskes, D. Kühlwein, E. Tsivtsivadze, and J. Urban (2014) Premise selection for mathematics by corpus analysis and kernel methods. Journal of Automated Reasoning 52 (2), pp. 191–213. Cited by: §2.
  • [4] I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3606–3611. Cited by: §6.
  • [5] J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban (2016) Hammering towards qed. Journal of Formalized Reasoning 9 (1), pp. 101–148. Cited by: §1.
  • [6] M. Cramer, B. Fisseni, P. Koepke, D. Kühlwein, B. Schröder, and J. Veldman (2009) The naproche project controlled natural language proof checking of mathematical texts. In International Workshop on Controlled Natural Language, pp. 170–186. Cited by: §2.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §6.
  • [8] M. Färber and C. Kaliszyk (2015) Random forests for premise selection. In Frontiers of Combining Systems, C. Lutz and S. Ranise (Eds.), Cham, pp. 325–340. External Links: ISBN 978-3-319-24246-0 Cited by: §2.
  • [9] M. Ganesalingam (2013) The language of mathematics. In The Language of Mathematics, pp. 17–38. Cited by: §1, §5.
  • [10] L. Gao, Z. Jiang, Y. Yin, K. Yuan, Z. Yan, and Z. Tang (2017) Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language?. CIKM 2017 Workshop on Interpretable Data Mining (IDM). External Links: Link Cited by: §2.
  • [11] T. Gauthier and C. Kaliszyk (2015) Premise selection and external provers for hol4. In Proceedings of the 2015 Conference on Certified Programs and Proofs, CPP ’15, New York, NY, USA, pp. 49–57. External Links: ISBN 978-1-4503-3296-5, Link, Document Cited by: §2.
  • [12] A. Greiner-Petter, T. Ruas, M. Schubotz, A. Aizawa, W. Grosky, and B. Gipp (2019) Why Machines Cannot Learn Mathematics, Yet. 4th BIRNDL workshop at 42nd SIGIR. External Links: Link Cited by: §1.
  • [13] G. Irving, C. Szegedy, A. A. Alemi, N. Een, F. Chollet, and J. Urban (2016) DeepMath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pp. 2235–2243. Cited by: §1, §2.
  • [14] K. Krstovski and D. M. Blei (2018) Equation Embeddings. External Links: Link Cited by: §2.
  • [15] Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §6.
  • [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
  • [17] D. Solow (2002) How to read and do proofs an introduction to mathematical thought processes. Cited by: §3.
  • [18] E. Wallace, Y. Wang, S. Li, S. Singh, and M. Gardner (2019) Do nlp models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5310–5318. Cited by: §6.
  • [19] S. Warner (1990) Modern algebra. Courier Corporation. Cited by: §1.