1 Introduction
Comprehending mathematical text requires evaluating the semantics of its mathematical structures (such as expressions) and connecting its internal components with the respective definitions or premises [12].
Stateoftheart models for natural language processing, such as BERT
[7], have high scores for several tasks, such as entity recognition, textual entailment and machine translation, but they do not encode the intricate mathematical background knowledge needed to reason over mathematical discourse.The language of mathematics is composed of a combination of words and symbols, where symbols follow a different set of rules and have a specific alphabet. Nonetheless, word and symbols are interdependent in the context of mathematical discourse. This phenomenon is exclusive to mathematical language, not found in any other natural, or artificial, language [9], providing a unique and challenging application for semantic evaluation and natural language processing.
Understanding mathematical discourse has been explored before as a Mathematical Knowledge Extraction task [1]; however, several aspects of the mathematical discourse related to deeper and more granular reasoning over mathematical discourse has not yet been investigated. There is a lack of datasets in the literature needed for exploring and studying mathematical discourse and its associated interpretation and reasoning.
We propose the task of natural premise selection, inspired by the field of automated theorem processing. Premise selection appeared initially as a task of selecting a (useful) part of an extensive formal library in order to limit the search space for an Automated Theorem Proving (ATP) system, increasing the chance of finding a proof for a given conjecture [5]. Premises considered relevant are the ones that ATPs use for the automatic deduction process of finding a proof for a conjecture. The premise selection task is defined as: Given a collection of premises , an ATP system with given resource limits, and a new conjecture , predict those premises from that will most likely lead to an automatically constructed proof of by [13].
Natural premise selection is based not on formally structured mathematics, but on humangenerated mathematical text. It takes as input mathematical text, written in natural language and outputs relevant mathematical statements that could support a human in finding a proof for that mathematical text. The premises are composed by a set of supporting definitions and supporting propositions, that act as explanations for the proof process.
For example, the famous Fermat’s Little Theorem [19] has different possible proofs, one of them using the Euclid’s Lemma. In this example, Euclid’s Lemma would be considered useful for a human trying to prove Fermat’s Little Theorem; therefore, it is a premise for the conjecture that Fermat’s Little Theorem presents.
In order to evaluate this task, we propose a new dataset: NLPS (Natural Language  Premise Selection), using as a basis the humancurated website ProofWiki^{1}^{1}1https://proofwiki.org/wiki/Main_Page. This dataset opens possibilities of applications not only for the premise selection task but also for evaluating semantic representations for mathematical discourse (including embeddings), textual entailment for mathematics and natural language inference in the context of mathematical texts.
The contributions of this paper can be summarised as follows:

Proposal of a new NLP task: natural language premise selection.

A novel dataset, NLPS, to support the evaluation of premise selection methods using natural language corpora.

Comparison of different baselines for the natural premise selection task.
2 Related Work
NLP has been applied before in the context of general Mathematics. chaganty2016much proposes a new task for semantic analysis, the task of perspective generation, i.e., generating description to numerical values using other values as reference. huang2016well analyze different approaches to solve mathematical word problems and concludes that it is still an unsolved challenge.
ganesalingam2017fully propose a program that solves elementary mathematical problems, mainly in metric space theory, and presents solutions similar to the ones proposed by humans. The authors recognize that their system is operating at a disadvantage because human language involves several constraints that rule out many sound and effective tactics for generating proofs.
wang2018first propose an approach to automatically formalize informal mathematics using statistical parsing methods and largetheory automated reasoning. The idea is to convert from an informal statement to a formal one, using Mizar as the output language. After the statement has been correctly translated, it can be checked using an automatic tool.
Naproche (Natural language Proof Checking) [6] is a project focused on the development of a controlled natural language (CNL) for mathematical texts and adapting proof checking software to work with this language in order to check syntactical and mathematical correctness.
zinn2003computational proposes proof representation structures to represent mathematical discourse using discourse representation theory and also proposes a prototype that could be used to automate the process of generating proofs.
Approaches for creating embeddings of mathematical text have applied variations of the Skipgram model [16], extending it with a specific tokenization strategy for equations and mathematical terms. Most tokenization strategies will use the tree structure of an equation to define the target tokens and can range from considering the full equation [14] as a single token or decomposing its component expressions or at the individual symbollevel [10]
. GreinerPetter2019WhyYet developed a skipgrambased model using as a reference corpus a set of arXiv papers in HTML format using a termlevel tokenization granularity. The authors found that the induced vector space did not produce meaningful semantic clusters.
Premise selection is an approach generally used for selecting useful premises to prove conjectures in Automated Theorem Proving (ATP) systems [3]
. irving2016deepmath propose a neural architecture for premise selection using formal statements written in Mizar. Other authors have used machine learning approaches such as Kernelbased Learning
[2], kNN algorithm [11]and Random Forests
[8]. However, the neural approaches previously presented [13] have obtained higher scores at the premise selection task.3 Linguistic Considerations
In this section, we describe some of the linguistics features present in a mathematical corpus. Our aim is to examine its discourse in combination with natural language. The following definitions are not of mathematical objects since those already have established mathematical definitions; in this work, we are interested in how the different mathematical objects are presented inside the mathematical text.
Definition 1.
A mathematical expression , in a mathematical text, is defined by a set where , and is the set of symbols present in a certain mathematical domain of discourse, such as variables, constants and functions. A variable, for example, is considered an expression.
Definition 2.
An equation is defined as a combination of and an (in)equality predicate .
Definition 3.
A mathematical statement can be:

[noitemsep]

A sequence of words (from the mathematical domain) or;

A sequence of words and expressions and/or equations or;

A sequence of only equations.
Definition 4.
A mathematical text is a sequence of mathematical statements.
In the mathematical text, words, expressions and equations, can be directly related through a relationship of definiendum and definiens, where an expression, the definiendum, is defined by a mathematical statement or part of a mathematical statement, the definiens is also used to determine the set of values and properties associated with an expression.
Definition 5.
A mathematical definiens is the set of tuples composed by and the set of (part of) mathematical statements that declares and/or quantifies in the mathematical text . Figure 1 presents an example where the definiens and the definiendum are highlighted. A definiendum can have more than one definiens, for example, the expression “” is declared by the equation “”, and has the property “real function”. Therefore:
Different mathematical texts can also be related, since mathematical knowledge is often incremental, where one element depends on others. For example, in Figure 1, in order to understand the meaning of the presented text, we need to understand the definition of a real function, which is defined in another mathematical text.
Definition 6.
A mathematical supporting definition is the set of mathematical texts , where all elements in contains a definition of a concept presented in . For example, the theorem in Figure 1 is connected to the mathematical text that defines what is a real function.
Definition 7.
A definition is composed by a 4tuple , where is the definition text, is the set of categories that the definition belongs to, is the set of definiens in the text and is the set definitions that is referenced in . If is empty, we call it an atomic definition.
A mathematical proof is a particular mathematical text that tries to convince the reader that a specific hypothesis can lead to a conclusion [17]. Proofs often contain mathematical bindings. They can also be connected to other propositions, such as lemmas, theorems and corollaries, as we will define next.
Definition 8.
A mathematical supporting proposition is the set of propositions that helps support the argument proposed in the mathematical text of a proof. It is often used as an explanation for certain statements used for the construction of the proof. Figure 2 presents part of the proof, where the name of the supporting facts is highlighted. For example, the mathematical statement of Cauchy’s Mean Theorem is a supporting fact for the proof shown.
[boxsep=1.5mm] Let .
Note that, from Power of Positive Real Number is Positive: Real Number:
.
So:
(Exponent Combination Laws)  
(Exponent Combination Laws)  
(Cauchy’s Mean Theorem) 
Definition 9.
The set of premises of a mathematical text of a proof is the set of supporting facts and the set of supporting definitions , i.e., .
Definition 10.
A mathematical proof is defined is composed by a tuple , where is the proof text and is the set of premises for .
Definition 11.
A theorem is composed by a tuple ), where is the theorem’s text, is the set of categories that the theorem belongs to, is the set of definiens in the text, is the set of proofs for the theorem (one theorem can have more than one possible proof).
Definition 12.
Similarly, we can define a lemma . is composed by a 5tuple . With the addition of , theorem where the lemma occurs.
Definition 13.
A corollary is composed by a 5tuple , where is the theorem that derives .
4 Dataset Construction: NLPS
In this section, we present our dataset, NLPS, and detail the steps we took in order to construct it. Our dataset is available as a set of JSON files in http://github.com/debymf/nlps. A summary of the process is presented in Figure 3.
Parsing the corpus
The proposed dataset was extracted from the source code of ProofWiki. ProofWiki is an online compendium of mathematical proofs, with a goal to collect and classify mathematical proofs. ProofWiki contains links between theorems, definitions and axioms in the context of a mathematical proof, determining which dependencies are present. ProofWiki is manually curated by different collaborators; therefore, there are different styles of mathematical text and many elements cannot be extracted automatically.
Cleaning wiki tags
ProofWiki has wikimedia tags; however, ProofWiki has also specific tags related to the mathematical domain. Therefore, we cannot use default wiki extraction tools. A bespoke tool was developed to comply with ProofWiki’s tagging scheme. For example, there is a particular tag for referring to another mathematical text, using passages from other texts in order to support a claim (Figure 4).
Proof curation
Several pages in ProofWiki are not directly related to mathematical propositions or definitions, such as users pages, help pages, and pages about specific talks. We manually analysed the pages and removed the ones that are not definitions, lemmas, theorems or corollaries. Some pages also contained tags to indicate that
Extraction of categories
ProofWiki has associated categories for each page. However, the categories are not harmonised across definitions and propositions. We merged different categories that belonged to the same mathematical branch and selected the categories that contained at least 100 different entries. The categories selected are: Analysis, Set Theory, Number Theory, Abstract Algebra, Topology, Algebra, Relation Theory, Mapping Theory, Real Analysis, Geometry, Metric Spaces, Linear Algebra, Complex Analysis, Applied Mathematics, Order Theory, Numbers, Physics, Group Theory, Ring Theory, Euclidean Geometry, Class Theory, Discrete Mathematics, Plane Geometry and Units of Measurement
Extracting supporting facts
The pages in ProofWiki are connected using hyperlinks. We leverage this structure to extract supporting propositions and supporting definitions. From the definition mathematical text, we extract the hyperlinks connecting to other definitions and these links are the supporting definitions. From the mathematical text of proofs, we extract the hyperlinks to other propositions and we consider these as supporting propositions. For example, Figure 5 presents a theorem and its respective proof. The proof contain links (highlighted) to other propositions, these are supporting propositions needed in order to support the proof.
Annotating mathematical text
The entries in ProofWiki are often divided in sections, for NLPS, we are only interested in the sections that present a definition, a proposition or a proof. Proofs were curated (combining manual and automatic annotation) to contain only mathematic discourse, removing satellite discourse such as Historical Notes. Because some propositions can be proved in different ways, we also annotated the different proofs which can be found inside one single page.
Structuring the entries
Finally, the dataset is structured as follows:

Definitions entries are composed by a mathematical text and a set of supporting definitions.

Lemmas and Theorems have a mathematical text, a proof and a set of premises.

Corollaries are composed by a mathematical text, a proof, a set of premises and the theorem that derives the corollary.
5 Dataset Analysis
NLPS has a total of 20,401 different entries, composed of definitions, lemmas, corollaries and theorems, as shown in Table 1.
Type  Number of entries 

Definitions  5,633 
Lemmas  327 
Corollaries  292 
Theorems  14,149 
Total  20,401 
Figure 6 presents the distribution of different categories in the dataset.
Figure 7 presents a histogram with the frequency of the different number of premises. We can observe that the statements usually have a small number of premises, with statements containing between one and five premises. The highest number of premises for one theorem is (text for the theorem “The Sorgenfrey line is Lindelöf.”).
Similarly, the histogram in Figure 8 shows the frequency of the different number of dependencies.
We also computed how many times each statement is used as a premise, and we observed that most of the statements are used as dependencies for only a small subset of premises. A total of statements has between one and three dependants. On average, statements contain a total of symbols (characters and mathematical symbols). The specific number of tokens will depend on the type of tokenisation used for the mathematical symbols.
We can also represent the connections (premises) between different mathematical texts as a graph. This graph has a total of 14,393 nodes (the number of nodes is smaller than the number entries, since some of the entries are disconnected, and we do not consider those for the graph) and 34,874 edges.
The dataset provides a specific semantic modelling challenge for natural language processing as it requires specific tokenization, coreference resolution and the modelling of specific discourse structures tailored towards mathematical text. One crucial challenge is how to resolve the semantics of variables in mathematical expressions, which requires a particular binding method. As shown in Figure 9, variables that refer to the same set can often have different names. For example, in the definition of sine, the variable being used is , but and refers to the same set. Basically, variables serve as a mathematical alternative to anaphora [9].
6 Experiments
In order to identify the challenges of the task of natural premise selection using NLPS, we performed initial experiments using two baselines: TFIDF and PVDBOW [15]
. We use both techniques to create vector representations for all the mathematical texts. Then compute the cosine similarity between each entry and rank the results by proximity. We then compute the Mean Average Precision (MAP) for each baseline, ranking all possible premises, computed as:
where is the total number of documents, is the th mathematical text and AveP is the average precision.
Table 2 presents the initial results. We compare three different types of tokenisations for the mathematical elements. Initially, we treat the expressions and equations as single tokens; for example, the expression “” would be considered a single word. We also considered tokenised expressions, tokenising operations and operators, the expression “” would be tokenised as [‘’,‘’,‘’,‘’,‘’]. Finally, we tokenise the whole text as a sequence of characters. We run PVDBOW with the default parameters, comparing different sizes of embeddings, with the best results obtained with an embedding size of 100.
From these initial results, we can conclude that the task is semantically nontrivial and cannot be solved with retrieval strategies such as lexical overlap. We can also notice that we obtain better results when we tokenise the expressions, hinting that the elements inside the expressions have semantic properties that are relevant for determining the relevant premises. For the following experiments, we are using the tokenised expressions and PVDBOW with an embedding size of 100.
TFIDF  PVDBOW  

50  100  200  
Expression as words  0.073  0.048  0.051  0.046 
Tokenised expressions  0.089  0.069  0.073  0.072 
Char level  0.051  0.059  0.065  0.061 
In Table 3 we compare the results for different sizes of the dataset. We consider the full dataset and three different subsets with different categories. We can notice that for smaller datasets, both baselines perform better. This result was expected since with smaller datasets there are less possible premises, and elements from the same categories tend to be more uniform between themselves.
TFIDF  PVDBOW  

All Categories  0.089  0.076 
Algebra (1,241)  0.183  0.177 
Analysis (1,102)  0.191  0.212 
Number Theory (741)  0.242  0.188 
We can also consider the fact that the premises are transitive, i.e., if one a mathematical text has a premise and a mathematical text has as a premise, then should also be a premise of . In this case, the task becomes even more challenging, as we present in Table 4, where we consider the transitivity with two and three hops of distance. From the results, we notice that the more hops needed to obtain the premise, the worse our baselines perform.
TFIDF  PVDBOW  

1hop premises  0.089  0.073 
2hop premises  0.052  0.047 
3hop premises  0.038  0.031 
We also verify on how stateoftheart embedding models perform with such specific dataset. BERT [7] is reported to have performed in different NLP tasks, including understanding numeracy [18].
In order to use BERT, we formulate the problem as a pairwise relevance classification problem, where we aim to classify if one mathematical text is connected to another. We do not perform any preprocessing for the expressions.
For this experiment, we used the pretrained BERT model bertbaseuncased and SciBERT [4] model scibertscivocabuncased, finetuning for our task with a sequence classifier, adding a linear layer on top of the transformer vectors. The results are presented in Table 5. Even though BERT is not pretrained using a mathematical corpus, it performs better than TFIDF and PVDBOW. SciBERT perform slightly better than BERT, since it was trained in a scientific corpus, but not in a mathematical corpus. This hints that BERT trained from scratch in a mathematical corpus could have even better results, however, this is outside the scope of this work.
Model  MAP 

SciBERT  0.383 
BERT  0.377 
7 Conclusion
In this paper we proposed a new task for mathematical language processing: natural language premise selection. We also made a new dataset available for the evaluation of the task and we analysed how the dataset works with the task on different baselines.
From our experiments we identified that handling mathematical symbols are crucial for solving the task, taking into consideration more specific semantics of operators and variables: such semantics are not captured using PVDM or BERT. This provides evidence on the need for specific embeddings and representation for mathematical formulas and discourse, which could most certainly improve the prediction of future work in the natural language premise selection task.
We also identify that the task becomes more challenging when we consider that the premises are transitive, suggesting that the task could benefit from graphbased representations.
Our dataset can be used in a different set of natural mathematical reasoning tasks, aiding researchers on the creation of mechanisms for improving the way machines understand mathematical text.
8 Acknowledgements
The authors would like to thank the anonymous reviewers for the constructive feedback.
9 Bibliographical References
References
 [1] (2014) NTCIR11 math2 task overview.. In NTCIR, Vol. 11, pp. 88–98. Cited by: §1.
 [2] (20140201) Premise selection for mathematics by corpus analysis and kernel methods. Journal of Automated Reasoning 52 (2), pp. 191–213. External Links: ISSN 15730670, Document, Link Cited by: §2.
 [3] (2014) Premise selection for mathematics by corpus analysis and kernel methods. Journal of Automated Reasoning 52 (2), pp. 191–213. Cited by: §2.
 [4] (2019) SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 3606–3611. Cited by: §6.
 [5] (2016) Hammering towards qed. Journal of Formalized Reasoning 9 (1), pp. 101–148. Cited by: §1.
 [6] (2009) The naproche project controlled natural language proof checking of mathematical texts. In International Workshop on Controlled Natural Language, pp. 170–186. Cited by: §2.
 [7] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §6.
 [8] (2015) Random forests for premise selection. In Frontiers of Combining Systems, C. Lutz and S. Ranise (Eds.), Cham, pp. 325–340. External Links: ISBN 9783319242460 Cited by: §2.
 [9] (2013) The language of mathematics. In The Language of Mathematics, pp. 17–38. Cited by: §1, §5.
 [10] (2017) Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language?. CIKM 2017 Workshop on Interpretable Data Mining (IDM). External Links: Link Cited by: §2.
 [11] (2015) Premise selection and external provers for hol4. In Proceedings of the 2015 Conference on Certified Programs and Proofs, CPP ’15, New York, NY, USA, pp. 49–57. External Links: ISBN 9781450332965, Link, Document Cited by: §2.
 [12] (2019) Why Machines Cannot Learn Mathematics, Yet. 4th BIRNDL workshop at 42nd SIGIR. External Links: Link Cited by: §1.
 [13] (2016) DeepMathdeep sequence models for premise selection. In Advances in Neural Information Processing Systems, pp. 2235–2243. Cited by: §1, §2.
 [14] (2018) Equation Embeddings. External Links: Link Cited by: §2.
 [15] (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §6.
 [16] (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
 [17] (2002) How to read and do proofs an introduction to mathematical thought processes. Cited by: §3.
 [18] (2019) Do nlp models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 5310–5318. Cited by: §6.
 [19] (1990) Modern algebra. Courier Corporation. Cited by: §1.