A Survey in Mathematical Language Processing

by   Jordan Meadows, et al.
Idiap Research Institute

Informal mathematical text underpins real-world quantitative reasoning and communication. Developing sophisticated methods of retrieval and abstraction from this dual modality is crucial in the pursuit of the vision of automating discovery in quantitative science and mathematics. We track the development of informal mathematical language processing approaches across five strategic sub-areas in recent years, highlighting the prevailing successful methodological elements along with existing limitations.


page 1

page 2

page 3

page 4


Adventures in Mathematical Reasoning

"Mathematics is not a careful march down a well-cleared highway, but a j...

Mathematical Language Processing: Automatic Grading and Feedback for Open Response Mathematical Questions

While computer and communication technologies have provided effective me...

Analysis Methods in Neural Language Processing: A Survey

The field of natural language processing has seen impressive progress in...

Distilling Wikipedia mathematical knowledge into neural network models

Machine learning applications to symbolic mathematics are becoming incre...

Text analysis in financial disclosures

Financial disclosure analysis and Knowledge extraction is an important f...

Automated Conjecturing VII: The Graph Brain Project & Big Mathematics

The Graph Brain Project is an experiment in how the use of automated mat...

Process Extraction from Text: state of the art and challenges for the future

Automatic Process Discovery aims at developing algorithmic methodologies...

1 Introduction

Communicating quantitative science occurs through the medium of mathematical text, which contains expressions, formulae, and equations, most of which requiring accompanying description. Formulae and their explanations interweave with non-mathematical language to form cohesive discourse Meadows et al. (2022). Approaches that consider mathematical text have been proposed to solve a number of related tasks, but are yet to surpass human-level performance. Core areas include solving math word problems Hosseini et al. (2014), identifier-definition extraction and variable typing Pagael and Schubotz (2014); Stathopoulos et al. (2018), natural language premise selection Ferreira and Freitas (2020b), natural language theorem proving Welleck et al. (2021b), and formula retrieval Zanibbi et al. (2016b). While transformers Vaswani et al. (2017) have seen widespread success in many areas of language, it is not until recently they’ve demonstrated mathematical Rabe et al. (2020) and logical Clark et al. (2020) capabilities, since redefining state-of-the-art benchmarks in formula retrieval Peng et al. (2021) and solving math word problems Cobbe et al. (2021)

. Alongside transformers, graph neural networks (GNNs) also exhibit diverse reasoning capabilities with respect to mathematical language, including premise selection 

Ferreira and Freitas (2020b) and mathematical question answering, up to algebraic manipulation Feng et al. (2021). There is a clear evolutionary path in mathematical language processing, from roots in explicit discourse representation Zinn (2003); Cramer et al. (2009) to the present day, where graph-based and transformer-based models deliver leading metrics in a few related tasks Ferreira and Freitas (2021); Liang et al. (2021); Zhang et al. (2022), complemented by explicit methods in some cases Zhong and Zanibbi (2019); Mansouri et al. (2019); Alexeeva et al. (2020); Peng et al. (2021). This survey provides a synthesis of this evolutionary arch: in Section 2 we discuss research contributions leading to the current state-of-the-art for each task where applicable, ending each discussion with notable limitations of the strongest approaches. In Section 3 we conclude, discussing promising directions for future research involving informal mathematics.

2 Representative Areas

Research considering the link between mathematics and language has diversified since early work with math word problems Feigenbaum et al. (1963); Bobrow (1964); Charniak (1969) and discourse representation theory-based linguistic analysis of formal theorems Zinn (1999, 2003). Contemporary focal areas are driven by target textual interpretation and inference tasks, with a significant emphasis on the empirical evaluation of models. Examples of such areas include identifier-definition extraction, math information retrieval and formula search, natural language premise selection, math word problem solving, and informal theorem proving. We project these areas into an inference spectrum displayed in Figure 1.

Figure 1: Extractive tasks are closer to the lexical and surface-level expression of the text while abstractive tasks tend to require the integration of symbolic-level and abstract reasoning.
Work Task Learning Approach Dataset Metrics Math Format Key Representation
Identifier-Definition Extraction
Kristianto et al. (2012) Expression-definition S CRF with linguistic pattern features LaTeX papers P, R, F1 MathML Definition noun phrases
Kristianto et al. (2014a) Expression-definition S SVM with linguistic pattern features LaTeX papers P, R, F1 MathML Definition noun phrases
Pagael and Schubotz (2014) Identifier-definition R

Gaussian heuristic ranking

Wikipedia articles P@K, R@K MathML Id-def explicit templates
Schubotz et al. (2016a) Identifier-definition UNS

Gaussian ranking + K-means namespace clusters

NTCIR-11 Math Wikipedia P, R, F1 LaTeX Namespace clusters
Schubotz et al. (2017) Identifier-definition S

G. rank + pattern matching + SVM

NTCIR-11 Math Wikipedia P, R, F1 LaTeX Pattern matching SVM features
Stathopoulos et al. (2018) Variable Typing S Link prediction with BiLSTM arXiv papers P, R, F1 MathML Type dictionary extended with DSTA
Alexeeva et al. (2020) Identifier-definition R Odin grammar MathAlign-Eval P, R, F1 LaTeX LaTeX segmentation and alignment
Jo et al. (2021)
Notation auto-suggestion
and consistency checking
S BERT fine-tuning S2ORC Top1, Top5, MRR LaTeX LaTeX macro representations
Formula Retrieval
Kristianto et al. (2014b) NTCIR-11 Math-2 S + R SVM description extraction + leaf-root path search NTCIR-11 Math-2 P@5, P@10, MAP MathML MathML leaf-root paths
Kristianto et al. (2016) NTCIR-12 MathIR S + R

MCAT (2014) + multiple linear regression

NTCIR-12 MathIR P@K MathML hash-based encoding and dependency graph
Zanibbi et al. (2016b)
NTCIR-11 Wikipedia
Formula Retrieval
R Inverted index ranking + MSS reranking search NTCIR-11 Wikipedia R@K, MRR MathML SLT leaf-root path tuples
Davila and Zanibbi (2017)
NTCIR-12 Wikipedia
Formula Browsing
Two-stage search for OPT and SLT
merged with linear regression
NTCIR-12 Wikipedia
Formula Browsing
P@K, Bpref, nDCG@K LaTeX + MathML SLT + OPT leaf-root path tuples
Zhong and Zanibbi (2019)
NTCIR-12 Wikipedia
Formula Browsing
OPT leaf-root path search
with K largest subexpressions
NTCIR-12 Wikipedia
Formula Browsing
P@K, Bpref LaTeX OPT leaf-root path tuples and subexpressions
Mansouri et al. (2019)
NTCIR-12 Wikipedia
Formula Browsing
UNS n-gram fastText OPT and SLT embeddings
NTCIR-12 Wikipedia
Formula Browsing
Bpref@1000 LaTeX + MathML SLT and OPT leaf-root path tuple n-grams
Peng et al. (2021)
NTCIR-12 Wikipedia
Formula Browsing + others
pre-training BERT with tasks related
to arXiv math-context pairs and OPTs
NTCIR-12 Wikipedia
Formula Browsing
Bpref@1000 LaTeX
LaTeX + natural language + OPT
pre-trained BERT transformer encodings
Informal Premise Selection
Ferreira and Freitas (2020b)
Natural Language
Premise Selection
S DGCNN for link prediction PS-ProofWiki P, R, F1 LaTeX Statement dependency graph
Ferreira and Freitas (2021)
Natural Language
Premise Selection
Self-attention for math and language
+ BiLSTM with Siamese Network
PS-ProofWiki P, R, F1 LaTeX Cross-model encoding for math and NL
Coavoux and Cohen (2021) Statement-Proof Matching S Weighted bipartite matching + self-attention SPM MRR + Acc MathML Self-attention encoding + bilinear similarity
Han et al. (2021) Informal premise selection S LLM fine-tuning (webtext + webmath) NaturalProofs R@K, avgP@K, full@K LaTeX Transformer encodings
Welleck et al. (2021a) Mathematical Reference Retrieval S Fine-tuning BERT with pair/joint parameterization NaturalProofs MAP, R@K, full@K LaTeX BERT encodings
Math Word Problem Solving
Liu et al. (2019) Math Word Problem Solving S BiLSTM seq encoder + LSTM tree-based decoder Math23K Acc NL Abstract Syntax Tree (AST)
Xie and Sun (2019) Math Word Problem Solving S GRU encoder + GTS decoder Math23K Acc NL Goal-driven Tree Structure (GTS)
Li et al. (2020) Math Word Problem Solving S
(word-word graph + phrase structure graph)
heterogeneous graph encoder + LSTM tree-based decoder
MAWPS, MATHQA Acc NL Dependency parse tree + constituency tree
Zhang et al. (2020) Math Word Problem Solving S
Word-number graph encoder +
Number-comp graph encoder + GTS decoder
MAWPS, Math23K Acc NL Word-number graph + Number comp graph
Shen and Jin (2020) Math Word Problem Solving S Seq multi-encoder + tree-based multi-decoder Math23K Acc NL Multi-encoding(/decoding)
Kim et al. (2020) Math Word Problem Solving S ALBERT seq encoder + Transformer seq decoder ALG514, DRAW-1K, MAWPS Acc NL ALBERT encodings
Qin et al. (2020) Math Word Problem Solving S
Bi-GRU seq encoder +
semantically-aligned GTS-based decoder
HMWP, ALG514, Math23K, Dolphin18K Acc NL Universal Expression Tree (UET)
Cao et al. (2021) Math Word Problem Solving S GRU-based encoder + DAG-LSTM decoder DRAW-1K, Math23K Acc NL Directed Acyclic Graph (DAG)
Lin et al. (2021) Math Word Problem Solving S Hierarchical GRU seq encoder + GTS decoder Math23K, MAWPS Acc NL Hierarchical word-clause-problem encodings
Qin et al. (2021) Math Word Problem Solving S Bi-GRU encoder + GTS decoder with attention and UET Math23K, CM17K Acc NL Representations from Auxiliary Tasks
Liang et al. (2021) Math Word Problem Solving S BERT encoder + GTS decoder Math23K, APE210K Acc NL BERT encodings
Zhang et al. (2022) Math Word Problem Solving S
(word-word graph + word-number graph + number-comp graph)
heterogeneous graph encoder + GTS decoder
MAWPS, Math23K Acc NL word-word, word-num, num-comp het. gr. enc.
Informal Theorem Proving
Kaliszyk et al. (2015b) Autoformalisation S Informal symbol sentence parsing with probabilistic CFGs HOL Light + Flyspeck - LaTeX to HOL/Flyspeck HOL parse trees
Wang et al. (2020) Autoformalisation S + UNS Machine translation with RNNs, LSTMs and Transformers LaTeX, Mizar, TPTP, ProofWiki
BLEU, Perplexity
Edit distance
LaTeX, Mizar, TPTP NMT, UNMT, XLM seq encodings
Meadows and Freitas (2021)
Equation Reconstruction
Automatic CAS Derivation
Similarity-based search with
string metrics and subexpression heuristics
PhysAI-368 Acc LaTeX LaTeX subexpression
Welleck et al. (2021a) Mathematical Reference Generation S Fine-tuning BERT with pair/joint parameterization NaturalProofs MAP LaTeX BERT encodings
Welleck et al. (2021b)
Next-step Suggestion
Full Proof Generation
BART encoder with denoising pre-training
and Fusion-in-Decoder
NaturalProofs SBleu, Meteor, Edit, P, R, F1 LaTeX BART encodings pre-trained with denoising tasks
Table 1: Summary of different methodologies for addressing tasks related to informal mathematical text. The methods are categorised in terms of (i) Task; (ii) Learning: Supervised (S), Self-supervised (SS), Unsupervised (UNS), Rule-based (R) (no learning); (iii) Approach; (iv) Dataset; (v) Metrics: MAP (Mean Average Precision), P@K (Precision at K), Perplexity, P (Precision), R (Recall), F1, Acc (Accuracy), BLEU, METEOR, MRR (Mean Reciprocal Rank), Edit (edit distance); (vi) Math format: MathML, LaTeX, natural language, or formal library (HOL, Mizar, Flyspeck, TPTP); (vii) Key representation of input text crucial to the approach.

2.1 Identifier-Definition Extraction

Figure 2: Taxonomy for approaches related to identifier-definition extraction. “Intra-doc” and “extra-doc” refers to how identifiers and definitions are scoped from supporting text.

A significant proportion of variables or identifiers in formulae or text are explicitly defined within a discourse context Wolska and Grigore (2010). Descriptions are usually local to the first instance of the identifiers in the discourse. It is the broad goal of identifier-definition extraction and related tasks to pair-up identifiers with their counterpart descriptions.
The task has not converged to a canonical form. Despite the clarity of its overall aim, the task has materialised into different forms: Kristianto et al. (2012) predict descriptions given expressions, Pagael and Schubotz (2014) predict descriptions given identifiers through identifier-definition extraction, Stathopoulos et al. (2018) predict if a type matches a variable through variable typing, and Jo et al. (2021) predict notation given context through notation auto-suggestion and notation consistency checking tasks. More concretely, identifier-definition extraction Schubotz et al. (2016a) involves scoring identifier-definiens pairs, where a definiens is a potential natural language description of the identifier. Given graph nodes from predefined variables and types , variable typing Stathopoulos et al. (2018) is the task of classifying whether edges are either existent (positive) or non-existent (negative), where a positive classification means a variable matches with the type. Notation auto-suggestion Jo et al. (2021) uses the text of both the sentence containing notation and the previous sentence to model future notation from the vocabulary of the tokenizer. The evolution of the overall area can be traced from an early ranking task Pagael and Schubotz (2014) reliant on heuristics and rules Alexeeva et al. (2020), through ML-based edge classification Stathopoulos et al. (2018), to language modelling with Transformers Jo et al. (2021). Different datasets are proposed for each task variant.
There is a high variability in scoping definitions. The scope from which identifiers are linked to descriptions varies significantly, and it is difficult to compare model performance even when tackling the same variant of the task Schubotz et al. (2017); Alexeeva et al. (2020). At a local context, models such as Pagael and Schubotz (2014) and Alexeeva et al. (2020) match identifiers with definitions from the same document “as the author intended", while other identifier-definition extraction methods Schubotz et al. (2016a, 2017) rely on data external to a given document, such as links to semantic concepts on Wikidata and NTCIR-11 test data Schubotz et al. (2015). At a broader context, the variable typing model proposed in Stathopoulos et al. (2018) relies on an external dictionary of types Stathopoulos and Teufel (2015, 2016); Stathopoulos et al. (2018) extracted from both the Encyclopedia of Mathematics111https://encyclopediaofmath.org and Wikipedia.
Vector representations have evolved to transfer knowledge from previous tasks, allowing downstream variable typing tasks to benefit from pretrained natural language embeddings.

Overall, vector representations of text have evolved from feature-based vectors learned from scratch for a single purpose, to the modern paradigm of pre-trained embeddings re-purposed for novel tasks.

Kristianto et al. (2012) input pattern features into a conditional random fields model for the purpose of identifying definitions of expressions in LaTeX papers. Kristianto et al. (2014a) learn vectors through a linear-kernel SVM with input features comprising of sentence patterns, part-of-speech (POS) tags, and tree structures. Stathopoulos et al. (2018) extend this approach by adding type- and variable-centric features as a baseline also with a linear kernel. Alternatively, Schubotz et al. (2017) use a Gaussian scoring function Schubotz et al. (2016b) and pattern matching features Pagael and Schubotz (2014)

as input to an SVM with a radial basis function (RBF) kernel, to account for non-linear feature characteristics. Alternative classification approaches 

Kristianto et al. (2012); Stathopoulos et al. (2018) do not use input features derived from non-linear functions, such as the Gaussian scoring function, and hence use linear kernels. Embedding spaces have been learned in this context for the purpose of ranking identifier-definiens pairs through latent semantic analysis at the document level, followed by the application of clustering techniques and methods of relating clusters to namespaces inherited from software engineering Schubotz et al. (2016a). These cluster-based namespaces are later used for classification Schubotz et al. (2017) rather than ranking, but do not positively impact SVM model performance, despite previous evidence suggesting they resolve co-references Duval et al. (2002) such as “ is energy" and “ is expectation value". Neither clustering nor namespaces have been further explored in this context. While a more recent model learns context-specific word representations after feeding less specific pre-trained word2vec Mikolov et al. (2013); Stathopoulos and Teufel (2016) embeddings to a bidirectional LSTM for classification Stathopoulos et al. (2018), the most recent work predictably relies on more sophisticated pre-trained BERT embeddings Devlin et al. (2018) for the language modelling of mathematical notation Jo et al. (2021).
Identifier-definition extraction limitations. Methods considering the specific link between identifiers and their definitions have split off into at least three recent tasks: identifier-definition extraction Schubotz et al. (2017); Alexeeva et al. (2020), variable typing Stathopoulos et al. (2018), and notation auto-suggestion Jo et al. (2021). A lack of consensus on the framing of the task and data prevents a direct comparison between methods. Schubotz et al. (2017) advise against using their gold standard data for training due to certain extractions being too difficult for automated systems, among other reasons. They also propose future research should focus on recall due to current methods extracting exact definitions for only 1/3 of identifiers, and suggest use of multilingual semantic role labelling Akbik et al. (2016) and logical deduction Schubotz et al. (2016b). Logical deduction is partially tackled by Alexeeva et al. (2020), which is based on an open-domain causal IE system Sharp et al. (2019) with Odin grammar Valenzuela-Escárcega et al. (2016), where temporal logic is used to obtain intervals referred to by pre-identified time expressions Sharp et al. (2019). We assume the issues with superscript identifiers (such as Einstein notation etc.) from Schubotz et al. (2016b) carry over into Schubotz et al. (2017). The rule-based approach proposed by Alexeeva et al. (2020) attempts to account for such notation (known as wildcards in formula retrieval). They propose future methods should combine grammar with a learning framework, extend rule sets to account for coordinate constructions, and create well-annotated training data using tools such as PDFAlign and others Asakura et al. (2021).

2.2 Formula Retrieval

Figure 3:

Taxonomy for approaches related to formula retrieval (math information retrieval). In the “SLT + OPT” (top right) the asterisks in MCAT* and MathBERT* refer to how SLTs and/or OPTs are not encoded directly from trees as seen in any of the Tangent approaches or Approach0. MCAT encodes SLTs implicitly through consideration of MathML Presentation, and OPTs through MathML Content. MathBERT encodes OPT tree information but implicitly encodes SLT informaton through LaTeX formulae, similarly to MCAT. The number in the bottom right of the lower-most boxes is the harmonic mean of partial and full Bpref@1000.

We discuss approaches related to the NTCIR-11/12 MathIR Wikipedia Formula Retrieval/Browsing Tasks Zanibbi et al. (2016a). Similar to NTCIR-11, the NTCIR-12 MathIR Task objective is to build math information retrieval (MIR) systems that enable users to search for a particular math concept using math formulae. Given a query which contains a target formula expressed in MathML and several related keywords, each participating system in this task is expected to return a ranked list of the relevant retrieval units containing formulae matching the query Kristianto et al. (2016).
Combining formula tree representations improves retrieval. Two main tree representations of formulae exist: Symbol Layout Trees (SLTs) and Operator Trees (OPTs), shown in Figure 4.

Figure 4: Formula (a) with its Symbol Layout Tree (SLT) (b), and Operator Tree (OPT) (c). SLTs represent formula appearance by the spatial arrangements of math symbols, while OPTs define the mathematical operations represented in expressions.

Approaches reliant solely on SLTs, such as the early versions of the Tangent retrieval system Pattaniyil and Zanibbi (2014); Zanibbi et al. (2015, 2016b), or solely OPTs Zhong and Zanibbi (2019); Zhong et al. (2020) tend to return less relevant formulae from queries. OPTs capture formula semantics while SLTs capture visual structure Mansouri et al. (2019). Effective representation of both formula layout and semantics within a single vector allows a model exploit both representations. Tangent-S Davila and Zanibbi (2017) was the first evolution of the Tangent system to outperform the NTCIR-11 Aizawa et al. (2014) overall best performer, MCAT Kristianto et al. (2014b, 2016), which encoded path and sibling information from MathML Presentation (SLT-based) and Content (OPT-based). Tangent-S jointly integrated SLTs and OPTs by combining scores for each representation through a simple linear regressor. Later, Tangent-CFT Mansouri et al. (2019) considered SLTs and OPTs through a fastText Bojanowski et al. (2017) n-gram embedding model using tree tuples. MathBERT Peng et al. (2021) does not explicitly account for SLTs. They claim that LaTeX codes account for SLTs to some extent and therefore focus on encoding OPTs. They pre-train the BERT Vaswani et al. (2017) model with targeted objectives each accounting for different aspects of mathematical text. They account for OPTs by concatenating node sequences to formula + context BERT input sequences, and by formulating OPT-based structure-aware pre-training tasks learned in conjunction with masked language modelling (MLM).
Leaf-root path tuples deliver an effective mechanism for embedding relations between symbol pairs. Leaf-root path tuples are now ubiquitous in formula retrieval Zanibbi et al. (2015, 2016b); Davila and Zanibbi (2017); Zhong and Zanibbi (2019); Mansouri et al. (2019); Zhong et al. (2020) and their use for NTCIR-11/12 retrieval has varied since their conception Stalnaker and Zanibbi (2015). Initially Pattaniyil and Zanibbi (2014) pair tuples were used within a TF-IDF weighting scheme, then Zanibbi et al. (2015, 2016b) proposed an appearance-based similarity metric using SLTs, maximum subtree similarity (MSS). OPT tuples are integrated Davila and Zanibbi (2017) later on. Mansouri et al. (2019) treat tree tuples as words, extract n-grams, and learn fastText Bojanowski et al. (2017) formula embeddings. Zhong and Zanibbi (2019); Zhong et al. (2020)

forgo machine learning altogether with an OPT-based heuristic search (Approach0) through a generalisation of MSS 

Zanibbi et al. (2016b). Leaf-root path tuples effectively map symbol-pair relations and account for formula substructure, but there is dispute on how best to integrate them into existing machine learning or explicit retrieval frameworks. There exists contest between well-developed similarity heuristics Zhong and Zanibbi (2019) and embedding techniques Mansouri et al. (2019), despite their complementarity.
Purely explicit methods still deliver competitive results. Tangent-CFT Mansouri et al. (2019) and MathBERT Peng et al. (2021) are two models to employ learning techniques beyond the level of linear regression. Each model is integrated with Approach0 Zhong and Zanibbi (2019) through the linear combination of individual model scores. This respectively forms the TanApp and MathApp baselines, the state-of-the-art in formula retrieval for non-wildcard queries. Approach0 achieves the highest full Bpref score Peng et al. (2021) of the individual models, and highlights the power of explicit methods.
Formula retrieval limitations. Zhong and Zanibbi (2019) propose supporting query expansion of math synonyms to improve recall, and note that Approach0 does not support wildcard queries. Zhong et al. (2020) later provides basic support for wildcards. Tangent-CFT also does not evaluate on wildcard queries, and the authors suggest extending the test selection to include more diverse formulae, particularly those that are not present as exact matches. They propose integrating nearby text into learned embeddings. MathBERT Peng et al. (2021) performs such integration, but does not learn n-gram embeddings. MathBERT evaluates on non-wildcard queries only.

2.3 Informal Premise Selection

Formal and informal premise selection both involve the selection of relevant statements that are useful for proving a given conjecture Irving et al. (2016); Wang et al. (2017); Ferreira and Freitas (2020a). The difference lies in the language from which the premises and related proof elements are composed, and their compatibility with existing Automated Theorem Provers (ATPs). Informal language is not compatible with existing provers without autoformalisation Wang et al. (2020); a current bottleneck Irving et al. (2016). Typically, when reasoning over large formal libraries comprising thousands of premises, the performance of ATPs degrades considerably, while for a given proof only a fraction of the premises are required to complete it Urban et al. (2010); Alama et al. (2014). Theorem proving is essentially a search problem with a combinatorial search space, and the goal of formal premise selection is to reduce the space, making theorem proving tractable Wang et al. (2017). While formal premises are written in the languages of formal libraries such as Mizar Rudnicki (1992), informal premises (and theorems) as seen in ProofWiki222https://proofwiki.org/wiki/Main_Page are written in combinations of natural language and LaTeX Ferreira and Freitas (2020a); Welleck et al. (2021a). Proposed approaches either rank Han et al. (2021) or classify Ferreira and Freitas (2020b, 2021) candidate premises for a given proof, detached from formal libraries and ATPs. Informal premise selection is a recently emerging field. Figure 1 describes it as a mid-spectrum task between retrieval and abstraction. Premise selection models select from existing text without explicitly reasoning beyond it. However, proficient models may be somewhat logical by proxy through the very nature of selecting premises for mathematical reasoning chains. An example of informal premise selection is expressed through the natural language premise selection task, where, given a new conjecture that requires a mathematical proof and a collection (or knowledge base) of premises , with size , the goal is to retrieve premises most likely to be useful for proving  Ferreira and Freitas (2020a, b). This is formulated as a classification problem. Alternatively, Welleck et al. (2021a) propose mathematical reference retrieval as an analogue of premise selection. The goal is to retrieve the set of references (theorems, lemmas, definitions) that occur in its proof, formulated as a ranking problem (retrieval).

Separate mechanisms for representing mathematics and natural language can improve performance. Regardless of the task variation, current approaches Ferreira and Freitas (2020b); Welleck et al. (2021a); Han et al. (2021); Coavoux and Cohen (2021) tend to jointly consider mathematics and language as a whole, not specifically accounting for aspects of each modality. Leading approaches for formula retrieval Peng et al. (2021); Mansouri et al. (2019) or solving math word problems Kim et al. (2020); Liang et al. (2021); Zhang et al. (2022) do not follow this trend. Ferreira and Freitas (2020b) extract a dependency graph representing dual-modality mathematical statements as nodes, and formulate the problem as link prediction Zhang and Chen (2018) similar to variable typing Stathopoulos et al. (2018). Other transformer-based or self-attentive baselines Ferreira and Freitas (2020b); Welleck et al. (2021a); Han et al. (2021); Coavoux and Cohen (2021) also do not separate mathematical elements from natural language. They consider notation with the same depth as word-level tokens and encode them similarly. Research in neuroscience Butterworth (2002); Amalric and Dehaene (2016) suggests the brain handles mathematics separately to natural language: approaches in premise selection Ferreira and Freitas (2021) and other tasks Peng et al. (2021); Zhang et al. (2022) have prospered from encoding mathematics through a separate mechanism to that of natural language. Ferreira and Freitas (2021) purposefully separate the two modalities, encoding each using self-attention and combining them with a bidirectional LSTM. Explicit disentanglement of the modalities forces the model to exploit latent relationships between language and mathematics through the LSTM layer.
Informal premise selection limitations. Limitations involve a lack of structural consideration of formulae and limited variable typing capabilities. Ferreira and Freitas (2020b) note that the graph-based approach to premise selection as link prediction struggles to encode mathematical statements which are mostly formulae, and suggest inclusion of structural embeddings (e.g. MathBERT Peng et al. (2021)) and training BERT on a mathematical corpus. They also describe value in formulating sophisticated heuristics for navigating the premises graph. Later, following a Siamese network architecture Ferreira and Freitas (2021) reliant on dual-layer word/expression self-attention and a BiLSTM (STAR), the authors demonstrate that STAR does not appropriately encode the semantics of variables. They suggest that variable typing and representation are a fundamental component of encoding mathematical statements. Han et al. (2021) plan to explore the effect of varying pre-training components, testing zero-shot performance without contrastive fine-tuning, and unsupervised retrieval. Coavoux and Cohen (2021) propose a statement-proof matching task akin to informal premise selection, with a solution reliant on a self-attentive encoder and bilinear similarity function. The authors note model confusion due to the proofs introducing new concepts and variables rather than referring to existing concepts.

2.4 Math Word Problems

Figure 5: Taxonomy for methods related to math word problem solving.

Solving math word problems dates back to the dawn of artificial intelligence 

Feigenbaum et al. (1963); Bobrow (1964); Charniak (1969). It can be defined as the task of translating a paragraph into a set of equations to be solved Li et al. (2020). We focus on trends in the task since 2019, as a detailed survey Zhang et al. (2019) captures prior work.
Use of dependency graphs are instrumental to support inference. In graph-based approaches to solving MWPs, embeddings of words, numbers, or relationship graph nodes, are learned through graph encoders which feed information through to tree (or sequence) decoders. Embeddings are decoded into expression trees which determine the problem solution. Li et al. (2020) learn the mapping between a heterogeneous graph representing the input problem, and an output tree. The graph is constructed from word nodes with relationship nodes of a parsing tree. This is either a dependency parse tree or constituency tree. Zhang et al. (2020) represent two separate graphs: a quantity cell graph associating descriptive words with problem quantities, and a quantity comparison graph which retains numerical qualities of the quantity, and leverages heuristics to represent relationships between quantities such that solution expressions reflect a more realistic arithmetic order. Shen and Jin (2020) also extract two graphs: a dependency parse tree and numerical comparison graph. Zhang et al. (2022) construct a heterogeneous graph from three subgraphs: a word-word graph containing syntactic and semantic relationships between words, a number-word graph, and a number comparison graph. Their model is the best performing graph-based approach to date. Although other important differences exist (such as decoder choice), it seems that models explicitly relating multiple linguistic aspects of problem text tend to deliver better problem solving.
Multi-encoders and multi-decoders improve performance by combining complementary representations. Another impactful architectural decision is the choice of encoder/decoder. To highlight this, we consider the following comparison. Shen and Jin (2020) and Zhang et al. (2020) each extract two graphs from the problem text. One is a number comparison graph, and the other relates word-word pairs Shen and Jin (2020) or word-number pairs Zhang et al. (2020). They both encode two graphs rather than one heterogeneous graph Li et al. (2020); Zhang et al. (2022). They both use a similar tree-based decoder Xie and Sun (2019). A key difference is that Shen and Jin (2020) includes an additional sequence-based encoder and decoder. The sequence-based encoder first obtains a textual representation of the input paragraph, then the graph-based encoder integrates the two encoded graphs. Then tree-based and sequence-based decoders generate different equation expressions for the problem with an additional mechanism for optimising solution expression selection. In their own work, Shen and Jin (2020) demonstrate the impact of multi-encoders/decoders over each encoder/decoder option individually through ablation.
Goal-driven decompositional tree-based decoders are a significant component in the state-of-the-art. Introduced in Xie and Sun (2019), this class of decoder is considered by all but three discussed models, as shown in Figure 5 and extends to non-graph-based models Qin et al. (2021); Liang et al. (2021). In GTS, goal vectors guide construction of expression subtrees (from token node embeddings) in a recursive manner, until a solution expression tree is generated. Proposed models do expand on the GTS-based decoder through the inclusion of semantically-aligned universal expression trees Qin et al. (2020, 2021), though this adaptation is not as widely used. The state-of-the-art Liang et al. (2021); Zhang et al. (2022) approaches follow the GTS decoder closely.
Language models that transfer knowledge learned from auxiliary tasks rival models based on explicit graph representation of problem text. As an alternative to encoding explicit relations through graphs, other work Kim et al. (2020); Qin et al. (2021); Liang et al. (2021) relies on pre-trained transformer-based models, and those which incorporate auxiliary tasks assumed relevant for solving MWPs to latently learn such relations. However, it seems the case that auxiliary tasks alone do not deliver competitive performance Qin et al. (2020) without the extensive pre-training efforts with large corpora, as we see with BERT-based transformer models. These use either both the (ALBERT Lan et al. (2019)) encoder and decoder Kim et al. (2020), or BERT-based encoder with goal-driven tree-based decoder Liang et al. (2021).
Math word problem limitations. In Graph2Tree-Z Zhang et al. (2020), they suggest considering more complex relations between quantities and language, and introducing heuristics to improve solution expression generation from the tree-based decoder. In EPT, Kim et al. (2020)

find error probability related to fragmentation issues increases exponentially with number of unknowns, and propose generalising EPT to other MWP datasets. HGEN 

Zhang et al. (2022) note three areas of future improvement: Combining models into a unified framework through ensembling multiple encoders (similar to Ferreira and Freitas (2021)); integrating external knowledge sources (e.g. HowNet Dong and Dong (2003), Cilin Hong-Minh and Smith (2008)); and real-world dataset development for unsupervised or weakly supervised approaches Qin et al. (2020).

2.5 Informal Theorem Proving

Formal automated theorem proving in logic is among the most advanced and abstract forms of reasoning materialised in the AI space. There are two major bottlenecks Irving et al. (2016) formal methods must overcome: (1) translating informal mathematical text into formal language (autoformalisation

), and (2) a lack of strong automated reasoning methods to fill in the gaps in already formalised human-written proofs. Informal methods either tackle autoformalisation directly 

Wang et al. (2020); Wu et al. (2022), or circumvent it through language modelling-based proof generation Welleck et al. (2021a, b), trading formal rigour for flexibility. Transformer-based models have been proposed for mathematical reasoning Polu and Sutskever (2020); Rabe et al. (2020); Wu et al. (2021). Converting informal mathematical text into forms interpretable by computers Kaliszyk et al. (2015a, b); Szegedy (2020); Wang and Deng (2020); Meadows and Freitas (2021) is closer to the real-world reasoning and communication format followed by mathematicians.
Autoformalisation could be addressed through approximate translation and exploration rather than direct machine translation. A long-studied and extremely challenging endeavour Zinn (1999, 2003); autoformalisation involves converting informal mathematical text into language interpretable by theorem provers Kaliszyk et al. (2015b); Wang et al. (2020); Szegedy (2020). Kaliszyk et al. (2015b) propose a statistical learning approach for parsing ambiguous formulae over the Flyspeck formal mathematical corpus Hales (2006). Later, thanks to improved machine translation capabilities Luong et al. (2017); Lample et al. (2018); Lample and Conneau (2019), Wang et al. (2020) explore dataset translation experiments between LaTeX code extracted from ProofWiki, and formal libraries Mizar Rudnicki (1992) and TPTP Sutcliffe and Suttner (1998)

. The supervised RNN-based neural machine translation model 

Luong et al. (2017) outperforms the transformer-based Lample et al. (2018) and MLM pre-trained transformer-based Lample and Conneau (2019) models, with the performance boost stemming from its use of alignment data. Szegedy (2020) advises against such direct translation efforts, instead proposing a combination of exploration and approximate translation through predicting formula embeddings. In seq2seq models, embeddings are typically granular, encoding word-level or symbol-level Jo et al. (2021) tokens. The suggestion is to learn mappings from natural language input to premise statements nearby the desired statement in the embedding space, traversing the space between statements using a suitable prover Bansal et al. (2019). Guided mathematical exploration for real-world proofs is still an unaddressed problem and does not scale well with step-distance between current and desired formulae. It may be easier to continue with direct translation Wang et al. (2020). For example, Wu et al. (2022) report promising results, directly autoformalising small competition problems to Isabelle statements using language models. Similar to previous suggestion Szegedy (2020), they also autoformalize statements as targets for proof search with a neural theorem prover.
Need for developing robust interactive natural language theorem provers. We discuss the closest equivalent to formal theorem proving in an informal setting. Welleck et al. (2021a) propose a mathematical reference generation task. Given a mathematical claim, the order and number of references within a proof are predicted. A reference is a theorem, definition, or a page that is linked to within the contents of a statement or proof. Each theorem has a proof containing a sequence of references , for references . Where the retrieval task assigns a score to each reference in , the generation task produces a variable length of sequence of references with the goal of matching , for which a BERT-based model is employed and fine-tuned on various data sources. Welleck et al. (2021b) expand on their proof generation work, proposing two related tasks: next-step suggestion, where a step from a proof (as described above) is defined as a sequence of tokens to be generated, given the previous steps and the claim ; and full-proof generation which extends this task to generate the full proof. They employ BART Lewis et al. (2019), an encoder-decoder model pre-trained with denoising tasks, and augment the model with reference knowledge using Fusion-in-Decoder Izacard and Grave (2020). The intermediate denoising training and knowledge-grounding improve model performance by producing better representations of (denoised) references for deployment at generation time, and by encoding reference-augmented inputs. Aiming towards automatic physics derivation, Meadows and Freitas (2021) propose an equation reconstruction task similar to next-step suggestion, where, given a sequence of LaTeX strings from a computer algebra physics derivation , the intermediate string is removed, and must be re-derived. The similarity-based heuristic search selects two consecutive computer algebra operations from a knowledge base and sequentially applies them to , in order to derive the known equation . If is obtained, then the equation after the first operation is taken as , and a partial derivation is achieved.
Informal theorem proving limitations. Wang et al. (2020) suggest the development of high-quality datasets for evaluating translation models, including structural formula representations, and jointly embedding multiple proof assistant libraries to increase formal dataset size. Szegedy (2020) argues that reasoning systems based on self-driven exploration without informal communication capabilities would suffer usage and evaluation difficulties. Wu et al. (2022) note limitations with text window size and difficulty storing large formal theories with current language models. After proposing the NaturalProofs dataset, Welleck et al. (2021a) characterize error types for the full-proof generation and next-step suggestion tasks, noting issues with: (1) hallucinated references, meaning the reference does not occur in NaturalProofs; (2) non-ground-truth reference, meaning the reference does not occur in the ground-truth proof; (3) undefined terms; and (4) improper or irrelevant statement, meaning a statement that is mathematically invalid (e.g. ) or irrelevant to the proof; and (5) statements that do not follow logically from the preceding statements. Dealing with research-level physics, Meadows and Freitas (2021) note that the cost of semi-automated formalisation is significant and does not scale well, requiring detailed expert-level manual intervention. They also call for a set of well-defined computer algebra operations such that robust mathematical exploration can be guided in a goal-based setting.

3 Conclusion

In this work we deliver a synthesis of the recent evolutionary arch for strategic areas in mathematical language processing. We systematically describe the methods, challenges and trends within each area, eliciting consolidated modelling components and emerging methodological advances. In areas related to variable typing and formula retrieval, explicit methods compete with and complement embedding models. In word problem solving involving simpler mathematics, dependency graphs explicitly represent relationships between numerical tokens and language. Models either encode graph input or sequence input, and decode to solution expression trees via recursive goal-driven tree decoders. Research with multi-encoders/decoders suggests value in combining representations. For advanced mathematics, language-based premise selection models also use graph-based and transformer-based models, mostly learning formulae and language embeddings without integrating formula structure or variable typing. Limited autoformalisation of informal mathematics exists through machine translation, but it is elsewhere argued that approximate translation to related premises followed by exploration is more promising. Some circumvent formal libraries altogether through flexible proof generation, physics derivation, and premise selection in less formal environments. We hope future techniques will benefit from this synthesis.


  • A. Aizawa, M. Kohlhase, I. Ounis, and M. Schubotz (2014) NTCIR-11 math-2 task overview.. In NTCIR, Vol. 11, pp. 88–98. Cited by: §2.2.
  • A. Akbik, X. Guan, and Y. Li (2016) Multilingual aliasing for auto-generating proposition banks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3466–3474. Cited by: §2.1.
  • J. Alama, T. Heskes, D. Kühlwein, E. Tsivtsivadze, and J. Urban (2014) Premise selection for mathematics by corpus analysis and kernel methods. Journal of Automated Reasoning 52 (2), pp. 191–213. Cited by: §2.3.
  • M. Alexeeva, R. Sharp, M. A. Valenzuela-Escárcega, J. Kadowaki, A. Pyarelal, and C. Morrison (2020) Mathalign: linking formula identifiers to their contextual natural language descriptions. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 2204–2212. Cited by: §1, §2.1, Table 1.
  • M. Amalric and S. Dehaene (2016) Origins of the brain networks for advanced mathematics in expert mathematicians. Proceedings of the National Academy of Sciences 113 (18), pp. 4909–4917. Cited by: §2.3.
  • T. Asakura, Y. Miyao, A. Aizawa, and M. Kohlhase (2021) MioGatto: a math identifier-oriented grounding annotation tool. Technical report EasyChair. Cited by: §2.1.
  • K. Bansal, S. M. Loos, M. N. Rabe, C. Szegedy, and S. Wilcox (2019) Holist: an environment for machine learning of higher-order theorem proving (extended version). arXiv preprint arXiv:1904.03241. Cited by: §2.5.
  • D. G. Bobrow (1964) Natural language input for a computer problem solving system. Cited by: §2.4, §2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §2.2.
  • B. Butterworth (2002) Mathematics and the brain. Opening address to the Mathematical Association, Reading. Cited by: §2.3.
  • Y. Cao, F. Hong, H. Li, and P. Luo (2021) A bottom-up dag structure extraction model for math word problems. In Thirty-Fifth AAAI Conference on Artificial 2021, pp. 39–46. Cited by: Table 1.
  • E. Charniak (1969) Computer solution of calculus word problems. In Proceedings of the 1st international joint conference on Artificial intelligence, pp. 303–316. Cited by: §2.4, §2.
  • P. Clark, O. Tafjord, and K. Richardson (2020) Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867. Cited by: §1.
  • M. Coavoux and S. B. Cohen (2021) Learning to match mathematical statements with proofs. arXiv preprint arXiv:2102.02110. Cited by: §2.3, Table 1.
  • K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §1.
  • M. Cramer, B. Fisseni, P. Koepke, D. Kühlwein, B. Schröder, and J. Veldman (2009) The naproche project controlled natural language proof checking of mathematical texts. In International Workshop on Controlled Natural Language, pp. 170–186. Cited by: §1.
  • K. Davila and R. Zanibbi (2017) Layout and semantics: combining representations for mathematical formula search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1165–1168. Cited by: §2.2, Table 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
  • Z. Dong and Q. Dong (2003) HowNet-a hybrid language and knowledge resource. In

    International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003

    pp. 820–824. Cited by: §2.4.
  • E. Duval, W. Hodgins, S. Sutton, and S. L. Weibel (2002) Metadata principles and practicalities. D-lib Magazine 8 (4), pp. 1–10. Cited by: §2.1.
  • E. A. Feigenbaum, J. Feldman, et al. (1963) Computers and thought. New York McGraw-Hill. Cited by: §2.4, §2.
  • W. Feng, B. Liu, D. Xu, Q. Zheng, and Y. Xu (2021) GraphMR: graph neural network for mathematical reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3395–3404. Cited by: §1.
  • D. Ferreira and A. Freitas (2020a) Natural language premise selection: finding supporting statements for mathematical text. arXiv preprint arXiv:2004.14959. Cited by: §2.3.
  • D. Ferreira and A. Freitas (2020b) Premise selection in natural language mathematical texts. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7365–7374. Cited by: §1, §2.3, §2.3, Table 1.
  • D. Ferreira and A. Freitas (2021) STAR: cross-modal [sta] tement [r] epresentation for selecting relevant mathematical premises. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3234–3243. Cited by: §1, §2.3, §2.3, §2.4, Table 1.
  • T. C. Hales (2006) Introduction to the flyspeck project. In Dagstuhl Seminar Proceedings, Cited by: §2.5.
  • J. M. Han, T. Xu, S. Polu, A. Neelakantan, and A. Radford (2021) Contrastive finetuning of generative language models for informal premise selection. Cited by: §2.3, §2.3, Table 1.
  • T. Hong-Minh and D. Smith (2008) Word similarity in wordnet. In Modeling, Simulation and Optimization of Complex Processes, pp. 293–302. Cited by: §2.4.
  • M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman (2014) Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 523–533. Cited by: §1.
  • G. Irving, C. Szegedy, A. A. Alemi, N. Eén, F. Chollet, and J. Urban (2016) Deepmath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pp. 2235–2243. Cited by: §2.3, §2.5.
  • G. Izacard and E. Grave (2020) Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282. Cited by: §2.5.
  • H. Jo, D. Kang, A. Head, and M. A. Hearst (2021) Modeling mathematical notation semantics in academic papers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3102–3115. Cited by: §2.1, §2.5, Table 1.
  • C. Kaliszyk, J. Urban, U. Siddique, S. Khan-Afshar, C. Dunchev, and S. Tahar (2015a) Formalizing physics: automation, presentation and foundation issues. In International Conference on Intelligent Computer Mathematics, pp. 288–295. Cited by: §2.5.
  • C. Kaliszyk, J. Urban, and J. Vyskočil (2015b) Learning to parse on aligned corpora (rough diamond). In International Conference on Interactive Theorem Proving, pp. 227–233. Cited by: §2.5, Table 1.
  • B. Kim, K. S. Ki, D. Lee, and G. Gweon (2020) Point to the expression: solving algebraic word problems using the expression-pointer transformer model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3768–3779. Cited by: §2.3, §2.4, Table 1.
  • G. Y. Kristianto, A. Aizawa, et al. (2014a) Extracting textual descriptions of mathematical expressions in scientific papers. D-Lib Magazine 20 (11), pp. 9. Cited by: §2.1, Table 1.
  • G. Y. Kristianto, M. Nghiem, Y. Matsubayashi, and A. Aizawa (2012) Extracting definitions of mathematical expressions in scientific papers. In Proc. of the 26th Annual Conference of JSAI, pp. 1–7. Cited by: §2.1, Table 1.
  • G. Y. Kristianto, G. Topic, and A. Aizawa (2016) MCAT math retrieval system for ntcir-12 mathir task.. In NTCIR, Cited by: §2.2, §2.2, Table 1.
  • G. Y. Kristianto, G. Topic, F. Ho, and A. Aizawa (2014b) The mcat math retrieval system for ntcir-11 math track.. In NTCIR, Cited by: §2.2, Table 1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §2.5.
  • G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755. Cited by: §2.5.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §2.4.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §2.5.
  • S. Li, L. Wu, S. Feng, F. Xu, F. Xu, and S. Zhong (2020) Graph-to-tree neural networks for learning structured input-output translation with applications to semantic parsing and math word problem. arXiv preprint arXiv:2004.13781. Cited by: §2.4, Table 1.
  • Z. Liang, J. Zhang, J. Shao, and X. Zhang (2021) Mwp-bert: a strong baseline for math word problems. arXiv preprint arXiv:2107.13435. Cited by: §1, §2.3, §2.4, Table 1.
  • X. Lin, Z. Huang, H. Zhao, E. Chen, Q. Liu, H. Wang, and S. Wang (2021) Hms: a hierarchical solver with dependency-enhanced understanding for math word problem. In Thirty-Fifth AAAI Conference on Artificial 2021, pp. 4232–4240. Cited by: Table 1.
  • Q. Liu, W. Guan, S. Li, and D. Kawahara (2019) Tree-structured decoding for solving math word problems. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 2370–2379. Cited by: Table 1.
  • M. Luong, E. Brevdo, and R. Zhao (2017) Neural machine translation (seq2seq) tutorial. Cited by: §2.5.
  • B. Mansouri, S. Rohatgi, D. W. Oard, J. Wu, C. L. Giles, and R. Zanibbi (2019) Tangent-cft: an embedding model for mathematical formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 11–18. Cited by: §1, §2.2, §2.3, Table 1.
  • J. Meadows and A. Freitas (2021) Similarity-based equational inference in physics. Physical Review Research 3 (4), pp. L042010. Cited by: §2.5, Table 1.
  • J. Meadows, Z. Zhou, and A. Freitas (2022) PhysNLU: a language resource for evaluating natural language understanding and explanation coherence in physics. arXiv preprint arXiv:2201.04275. Cited by: §1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §2.1.
  • R. Pagael and M. Schubotz (2014) Mathematical language processing project. arXiv preprint arXiv:1407.0167. Cited by: §1, §2.1, Table 1.
  • N. Pattaniyil and R. Zanibbi (2014) Combining tf-idf text retrieval with an inverted index over symbol pairs in math expressions: the tangent math search engine at ntcir 2014.. In NTCIR, Cited by: §2.2.
  • S. Peng, K. Yuan, L. Gao, and Z. Tang (2021) Mathbert: a pre-trained model for mathematical formula understanding. arXiv preprint arXiv:2105.00377. Cited by: §1, §2.2, §2.3, Table 1.
  • S. Polu and I. Sutskever (2020) Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393. Cited by: §2.5.
  • J. Qin, X. Liang, Y. Hong, J. Tang, and L. Lin (2021) Neural-symbolic solver for math word problems with auxiliary tasks. arXiv preprint arXiv:2107.01431. Cited by: §2.4, Table 1.
  • J. Qin, L. Lin, X. Liang, R. Zhang, and L. Lin (2020) Semantically-aligned universal tree-structured solver for math word problems. arXiv preprint arXiv:2010.06823. Cited by: §2.4, Table 1.
  • M. N. Rabe, D. Lee, K. Bansal, and C. Szegedy (2020) Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757. Cited by: §1, §2.5.
  • P. Rudnicki (1992) An overview of the mizar project. In Proceedings of the 1992 Workshop on Types for Proofs and Programs, pp. 311–330. Cited by: §2.3, §2.5.
  • M. Schubotz, A. Grigorev, M. Leich, H. S. Cohl, N. Meuschke, B. Gipp, A. S. Youssef, and V. Markl (2016a) Semantification of identifiers in mathematics for better math information retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 135–144. Cited by: §2.1, Table 1.
  • M. Schubotz, L. Krämer, N. Meuschke, F. Hamborg, and B. Gipp (2017) Evaluating and improving the extraction of mathematical identifier definitions. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 82–94. Cited by: §2.1, Table 1.
  • M. Schubotz, D. Veenhuis, and H. S. Cohl (2016b) Getting the units right.. In FM4M/MathUI/ThEdu/DP/WIP@ CIKM, pp. 146–156. Cited by: §2.1.
  • M. Schubotz, A. Youssef, V. Markl, and H. S. Cohl (2015) Challenges of mathematical information retrievalin the ntcir-11 math wikipedia task. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 951–954. Cited by: §2.1.
  • R. Sharp, A. Pyarelal, B. Gyori, K. Alcock, E. Laparra, M. A. Valenzuela-Escárcega, A. Nagesh, V. Yadav, J. Bachman, Z. Tang, et al. (2019) Eidos, indra, & delphi: from free text to executable causal models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Cited by: §2.1.
  • Y. Shen and C. Jin (2020) Solving math word problems with multi-encoders and multi-decoders. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 2924–2934. Cited by: §2.4, Table 1.
  • D. Stalnaker and R. Zanibbi (2015) Math expression retrieval using an inverted index over symbol pairs. In Document recognition and retrieval XXII, Vol. 9402, pp. 34–45. Cited by: §2.2.
  • Y. Stathopoulos, S. Baker, M. Rei, and S. Teufel (2018) Variable typing: assigning meaning to variables in mathematical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 303–312. Cited by: §1, §2.1, §2.3, Table 1.
  • Y. Stathopoulos and S. Teufel (2015) Retrieval of research-level mathematical information needs: a test collection and technical terminology experiment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 334–340. Cited by: §2.1.
  • Y. Stathopoulos and S. Teufel (2016) Mathematical information retrieval based on type embeddings and query expansion. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2344–2355. Cited by: §2.1.
  • G. Sutcliffe and C. Suttner (1998) The tptp problem library. Journal of Automated Reasoning 21 (2), pp. 177–203. Cited by: §2.5.
  • C. Szegedy (2020) A promising path towards autoformalization and general artificial intelligence. In International Conference on Intelligent Computer Mathematics, pp. 3–20. Cited by: §2.5.
  • J. Urban, K. Hoder, and A. Voronkov (2010) Evaluation of automated theorem proving on the mizar mathematical library. In International Congress on Mathematical Software, pp. 155–166. Cited by: §2.3.
  • M. A. Valenzuela-Escárcega, G. Hahn-Powell, and M. Surdeanu (2016) Odin’s runes: a rule language for information extraction. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 322–329. Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.2.
  • M. Wang and J. Deng (2020) Learning to prove theorems by learning to generate theorems. arXiv preprint arXiv:2002.07019. Cited by: §2.5.
  • M. Wang, Y. Tang, J. Wang, and J. Deng (2017) Premise selection for theorem proving by deep graph embedding. arXiv preprint arXiv:1709.09994. Cited by: §2.3.
  • Q. Wang, C. Brown, C. Kaliszyk, and J. Urban (2020) Exploration of neural machine translation in autoformalization of mathematics in mizar. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, pp. 85–98. Cited by: §2.3, §2.5, Table 1.
  • S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi, and K. Cho (2021a) Naturalproofs: mathematical theorem proving in natural language. arXiv preprint arXiv:2104.01112. Cited by: §2.3, §2.3, §2.5, Table 1.
  • S. Welleck, J. Liu, J. M. Han, and Y. Choi (2021b) Towards grounded natural language proof generation. In MathAI4Ed Workshop at NeurIPS, Cited by: §1, §2.5, Table 1.
  • M. Wolska and M. Grigore (2010) Symbol declarations in mathematical writing. Cited by: §2.1.
  • Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jamnik, and C. Szegedy (2022) Autoformalization with large language models. arXiv. External Links: Document, Link Cited by: §2.5.
  • Y. Wu, M. N. Rabe, W. Li, J. Ba, R. B. Grosse, and C. Szegedy (2021) Lime: learning inductive bias for primitives of mathematical reasoning. In International Conference on Machine Learning, pp. 11251–11262. Cited by: §2.5.
  • Z. Xie and S. Sun (2019) A goal-driven tree-structured neural model for math word problems.. In IJCAI, pp. 5299–5305. Cited by: §2.4, Table 1.
  • R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, and K. Davila (2016a) NTCIR-12 mathir task overview.. In NTCIR, Cited by: §2.2.
  • R. Zanibbi, K. Davila, A. Kane, and F. W. Tompa (2016b) Multi-stage math formula search: using appearance-based similarity metrics at scale. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 145–154. Cited by: §1, §2.2, Table 1.
  • R. Zanibbi, K. Davila, A. Kane, and F. Tompa (2015) The tangent search engine: improved similarity metrics and scalability for math formula search. arXiv preprint arXiv:1507.06235. Cited by: §2.2.
  • D. Zhang, L. Wang, L. Zhang, B. T. Dai, and H. T. Shen (2019) The gap of semantic parsing: a survey on automatic math word problem solvers. IEEE transactions on pattern analysis and machine intelligence 42 (9), pp. 2287–2305. Cited by: §2.4.
  • J. Zhang, L. Wang, R. K. Lee, Y. Bin, Y. Wang, J. Shao, and E. Lim (2020) Graph-to-tree learning for solving math word problems. Cited by: §2.4, Table 1.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. arXiv preprint arXiv:1802.09691. Cited by: §2.3.
  • Y. Zhang, G. Zhou, Z. Xie, and J. X. Huang (2022) HGEN: learning hierarchical heterogeneous graph encoding for math word problem solving. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Cited by: §1, §2.3, §2.4, Table 1.
  • W. Zhong, S. Rohatgi, J. Wu, C. L. Giles, and R. Zanibbi (2020) Accelerating substructure similarity search for formula retrieval. In European Conference on Information Retrieval, pp. 714–727. Cited by: §2.2.
  • W. Zhong and R. Zanibbi (2019) Structural similarity search for formulas using leaf-root paths in operator subtrees. In European Conference on Information Retrieval, pp. 116–129. Cited by: §1, §2.2, Table 1.
  • C. Zinn (1999) Understanding mathematical discourse. In Proceedings of the Workshop on the Semantics and Pragmatics of Dialogue, Amsterdam University, Cited by: §2.5, §2.
  • C. Zinn (2003) A computational framework for understanding mathematical discoursexy. Logic Journal of IGPL 11 (4), pp. 457–484. Cited by: §1, §2.5, §2.