1 Introduction
Communicating quantitative science occurs through the medium of mathematical text, which contains expressions, formulae, and equations, most of which requiring accompanying description. Formulae and their explanations interweave with nonmathematical language to form cohesive discourse Meadows et al. (2022). Approaches that consider mathematical text have been proposed to solve a number of related tasks, but are yet to surpass humanlevel performance. Core areas include solving math word problems Hosseini et al. (2014), identifierdefinition extraction and variable typing Pagael and Schubotz (2014); Stathopoulos et al. (2018), natural language premise selection Ferreira and Freitas (2020b), natural language theorem proving Welleck et al. (2021b), and formula retrieval Zanibbi et al. (2016b). While transformers Vaswani et al. (2017) have seen widespread success in many areas of language, it is not until recently they’ve demonstrated mathematical Rabe et al. (2020) and logical Clark et al. (2020) capabilities, since redefining stateoftheart benchmarks in formula retrieval Peng et al. (2021) and solving math word problems Cobbe et al. (2021)
. Alongside transformers, graph neural networks (GNNs) also exhibit diverse reasoning capabilities with respect to mathematical language, including premise selection
Ferreira and Freitas (2020b) and mathematical question answering, up to algebraic manipulation Feng et al. (2021). There is a clear evolutionary path in mathematical language processing, from roots in explicit discourse representation Zinn (2003); Cramer et al. (2009) to the present day, where graphbased and transformerbased models deliver leading metrics in a few related tasks Ferreira and Freitas (2021); Liang et al. (2021); Zhang et al. (2022), complemented by explicit methods in some cases Zhong and Zanibbi (2019); Mansouri et al. (2019); Alexeeva et al. (2020); Peng et al. (2021). This survey provides a synthesis of this evolutionary arch: in Section 2 we discuss research contributions leading to the current stateoftheart for each task where applicable, ending each discussion with notable limitations of the strongest approaches. In Section 3 we conclude, discussing promising directions for future research involving informal mathematics.2 Representative Areas
Research considering the link between mathematics and language has diversified since early work with math word problems Feigenbaum et al. (1963); Bobrow (1964); Charniak (1969) and discourse representation theorybased linguistic analysis of formal theorems Zinn (1999, 2003). Contemporary focal areas are driven by target textual interpretation and inference tasks, with a significant emphasis on the empirical evaluation of models. Examples of such areas include identifierdefinition extraction, math information retrieval and formula search, natural language premise selection, math word problem solving, and informal theorem proving. We project these areas into an inference spectrum displayed in Figure 1.
Work  Task  Learning  Approach  Dataset  Metrics  Math Format  Key Representation  
IdentifierDefinition Extraction  
Kristianto et al. (2012)  Expressiondefinition  S  CRF with linguistic pattern features  LaTeX papers  P, R, F1  MathML  Definition noun phrases  
Kristianto et al. (2014a)  Expressiondefinition  S  SVM with linguistic pattern features  LaTeX papers  P, R, F1  MathML  Definition noun phrases  
Pagael and Schubotz (2014)  Identifierdefinition  R  Gaussian heuristic ranking 
Wikipedia articles  P@K, R@K  MathML  Iddef explicit templates  
Schubotz et al. (2016a)  Identifierdefinition  UNS  Gaussian ranking + Kmeans namespace clusters 
NTCIR11 Math Wikipedia  P, R, F1  LaTeX  Namespace clusters  
Schubotz et al. (2017)  Identifierdefinition  S  G. rank + pattern matching + SVM 
NTCIR11 Math Wikipedia  P, R, F1  LaTeX  Pattern matching SVM features  
Stathopoulos et al. (2018)  Variable Typing  S  Link prediction with BiLSTM  arXiv papers  P, R, F1  MathML  Type dictionary extended with DSTA  
Alexeeva et al. (2020)  Identifierdefinition  R  Odin grammar  MathAlignEval  P, R, F1  LaTeX  LaTeX segmentation and alignment  
Jo et al. (2021) 

S  BERT finetuning  S2ORC  Top1, Top5, MRR  LaTeX  LaTeX macro representations  
Formula Retrieval  
Kristianto et al. (2014b)  NTCIR11 Math2  S + R  SVM description extraction + leafroot path search  NTCIR11 Math2  P@5, P@10, MAP  MathML  MathML leafroot paths  
Kristianto et al. (2016)  NTCIR12 MathIR  S + R  MCAT (2014) + multiple linear regression 
NTCIR12 MathIR  P@K  MathML  hashbased encoding and dependency graph  
Zanibbi et al. (2016b) 

R  Inverted index ranking + MSS reranking search  NTCIR11 Wikipedia  R@K, MRR  MathML  SLT leafroot path tuples  
Davila and Zanibbi (2017) 

S 


P@K, Bpref, nDCG@K  LaTeX + MathML  SLT + OPT leafroot path tuples  
Zhong and Zanibbi (2019) 

R 


P@K, Bpref  LaTeX  OPT leafroot path tuples and subexpressions  
Mansouri et al. (2019) 

UNS  ngram fastText OPT and SLT embeddings 

Bpref@1000  LaTeX + MathML  SLT and OPT leafroot path tuple ngrams  
Peng et al. (2021) 

SS 


Bpref@1000  LaTeX 


Informal Premise Selection  
Ferreira and Freitas (2020b) 

S  DGCNN for link prediction  PSProofWiki  P, R, F1  LaTeX  Statement dependency graph  
Ferreira and Freitas (2021) 

S 

PSProofWiki  P, R, F1  LaTeX  Crossmodel encoding for math and NL  
Coavoux and Cohen (2021)  StatementProof Matching  S  Weighted bipartite matching + selfattention  SPM  MRR + Acc  MathML  Selfattention encoding + bilinear similarity  
Han et al. (2021)  Informal premise selection  S  LLM finetuning (webtext + webmath)  NaturalProofs  R@K, avgP@K, full@K  LaTeX  Transformer encodings  
Welleck et al. (2021a)  Mathematical Reference Retrieval  S  Finetuning BERT with pair/joint parameterization  NaturalProofs  MAP, R@K, full@K  LaTeX  BERT encodings  
Math Word Problem Solving  
Liu et al. (2019)  Math Word Problem Solving  S  BiLSTM seq encoder + LSTM treebased decoder  Math23K  Acc  NL  Abstract Syntax Tree (AST)  
Xie and Sun (2019)  Math Word Problem Solving  S  GRU encoder + GTS decoder  Math23K  Acc  NL  Goaldriven Tree Structure (GTS)  
Li et al. (2020)  Math Word Problem Solving  S 

MAWPS, MATHQA  Acc  NL  Dependency parse tree + constituency tree  
Zhang et al. (2020)  Math Word Problem Solving  S 

MAWPS, Math23K  Acc  NL  Wordnumber graph + Number comp graph  
Shen and Jin (2020)  Math Word Problem Solving  S  Seq multiencoder + treebased multidecoder  Math23K  Acc  NL  Multiencoding(/decoding)  
Kim et al. (2020)  Math Word Problem Solving  S  ALBERT seq encoder + Transformer seq decoder  ALG514, DRAW1K, MAWPS  Acc  NL  ALBERT encodings  
Qin et al. (2020)  Math Word Problem Solving  S 

HMWP, ALG514, Math23K, Dolphin18K  Acc  NL  Universal Expression Tree (UET)  
Cao et al. (2021)  Math Word Problem Solving  S  GRUbased encoder + DAGLSTM decoder  DRAW1K, Math23K  Acc  NL  Directed Acyclic Graph (DAG)  
Lin et al. (2021)  Math Word Problem Solving  S  Hierarchical GRU seq encoder + GTS decoder  Math23K, MAWPS  Acc  NL  Hierarchical wordclauseproblem encodings  
Qin et al. (2021)  Math Word Problem Solving  S  BiGRU encoder + GTS decoder with attention and UET  Math23K, CM17K  Acc  NL  Representations from Auxiliary Tasks  
Liang et al. (2021)  Math Word Problem Solving  S  BERT encoder + GTS decoder  Math23K, APE210K  Acc  NL  BERT encodings  
Zhang et al. (2022)  Math Word Problem Solving  S 

MAWPS, Math23K  Acc  NL  wordword, wordnum, numcomp het. gr. enc.  
Informal Theorem Proving  
Kaliszyk et al. (2015b)  Autoformalisation  S  Informal symbol sentence parsing with probabilistic CFGs  HOL Light + Flyspeck    LaTeX to HOL/Flyspeck  HOL parse trees  
Wang et al. (2020)  Autoformalisation  S + UNS  Machine translation with RNNs, LSTMs and Transformers  LaTeX, Mizar, TPTP, ProofWiki 

LaTeX, Mizar, TPTP  NMT, UNMT, XLM seq encodings  
Meadows and Freitas (2021) 

R 

PhysAI368  Acc  LaTeX  LaTeX subexpression  
Welleck et al. (2021a)  Mathematical Reference Generation  S  Finetuning BERT with pair/joint parameterization  NaturalProofs  MAP  LaTeX  BERT encodings  
Welleck et al. (2021b) 

S 

NaturalProofs  SBleu, Meteor, Edit, P, R, F1  LaTeX  BART encodings pretrained with denoising tasks 
2.1 IdentifierDefinition Extraction
A significant proportion of variables or identifiers in formulae or text are explicitly defined within a discourse context Wolska and Grigore (2010). Descriptions are usually local to the first instance of the identifiers in the discourse. It is the broad goal of identifierdefinition extraction and related tasks to pairup identifiers with their counterpart descriptions.
The task has not converged to a canonical form. Despite the clarity of its overall aim, the task has materialised into different forms: Kristianto et al. (2012) predict descriptions given expressions, Pagael and Schubotz (2014) predict descriptions given identifiers through identifierdefinition extraction, Stathopoulos et al. (2018) predict if a type matches a variable through variable typing, and
Jo et al. (2021) predict notation given context through notation autosuggestion and notation consistency checking tasks. More concretely, identifierdefinition extraction Schubotz et al. (2016a) involves scoring identifierdefiniens pairs, where a definiens is a potential natural language description of the identifier. Given graph nodes from predefined variables and types , variable typing Stathopoulos et al. (2018) is the task of classifying whether edges are either existent (positive) or nonexistent (negative), where a positive classification means a variable matches with the type. Notation autosuggestion Jo et al. (2021) uses the text of both the sentence containing notation and the previous sentence to model future notation from the vocabulary of the tokenizer. The evolution of the overall area can be traced from an early ranking task Pagael and Schubotz (2014) reliant on heuristics and rules Alexeeva et al. (2020), through MLbased edge classification Stathopoulos et al. (2018), to language modelling with Transformers Jo et al. (2021). Different datasets are proposed for each task variant.
There is a high variability in scoping definitions. The scope from which identifiers are linked to descriptions varies significantly, and it is difficult to compare model performance even when tackling the same variant of the task Schubotz et al. (2017); Alexeeva et al. (2020). At a local context, models such as Pagael and Schubotz (2014) and Alexeeva et al. (2020) match identifiers with definitions from the same document “as the author intended", while other identifierdefinition extraction methods Schubotz et al. (2016a, 2017) rely on data external to a given document, such as links to semantic concepts on Wikidata and NTCIR11 test data Schubotz et al. (2015). At a broader context, the variable typing model proposed in Stathopoulos et al. (2018) relies on an external dictionary of types Stathopoulos and Teufel (2015, 2016); Stathopoulos et al. (2018) extracted from both the Encyclopedia of Mathematics^{1}^{1}1https://encyclopediaofmath.org and Wikipedia.
Vector representations have evolved to transfer knowledge from previous tasks, allowing downstream variable typing tasks to benefit from pretrained natural language embeddings.
Overall, vector representations of text have evolved from featurebased vectors learned from scratch for a single purpose, to the modern paradigm of pretrained embeddings repurposed for novel tasks.
Kristianto et al. (2012) input pattern features into a conditional random fields model for the purpose of identifying definitions of expressions in LaTeX papers. Kristianto et al. (2014a) learn vectors through a linearkernel SVM with input features comprising of sentence patterns, partofspeech (POS) tags, and tree structures. Stathopoulos et al. (2018) extend this approach by adding type and variablecentric features as a baseline also with a linear kernel. Alternatively, Schubotz et al. (2017) use a Gaussian scoring function Schubotz et al. (2016b) and pattern matching features Pagael and Schubotz (2014)as input to an SVM with a radial basis function (RBF) kernel, to account for nonlinear feature characteristics. Alternative classification approaches
Kristianto et al. (2012); Stathopoulos et al. (2018) do not use input features derived from nonlinear functions, such as the Gaussian scoring function, and hence use linear kernels. Embedding spaces have been learned in this context for the purpose of ranking identifierdefiniens pairs through latent semantic analysis at the document level, followed by the application of clustering techniques and methods of relating clusters to namespaces inherited from software engineering Schubotz et al. (2016a). These clusterbased namespaces are later used for classification Schubotz et al. (2017) rather than ranking, but do not positively impact SVM model performance, despite previous evidence suggesting they resolve coreferences Duval et al. (2002) such as “ is energy" and “ is expectation value". Neither clustering nor namespaces have been further explored in this context. While a more recent model learns contextspecific word representations after feeding less specific pretrained word2vec Mikolov et al. (2013); Stathopoulos and Teufel (2016) embeddings to a bidirectional LSTM for classification Stathopoulos et al. (2018), the most recent work predictably relies on more sophisticated pretrained BERT embeddings Devlin et al. (2018) for the language modelling of mathematical notation Jo et al. (2021).Identifierdefinition extraction limitations. Methods considering the specific link between identifiers and their definitions have split off into at least three recent tasks: identifierdefinition extraction Schubotz et al. (2017); Alexeeva et al. (2020), variable typing Stathopoulos et al. (2018), and notation autosuggestion Jo et al. (2021). A lack of consensus on the framing of the task and data prevents a direct comparison between methods. Schubotz et al. (2017) advise against using their gold standard data for training due to certain extractions being too difficult for automated systems, among other reasons. They also propose future research should focus on recall due to current methods extracting exact definitions for only 1/3 of identifiers, and suggest use of multilingual semantic role labelling Akbik et al. (2016) and logical deduction Schubotz et al. (2016b). Logical deduction is partially tackled by Alexeeva et al. (2020), which is based on an opendomain causal IE system Sharp et al. (2019) with Odin grammar ValenzuelaEscárcega et al. (2016), where temporal logic is used to obtain intervals referred to by preidentified time expressions Sharp et al. (2019). We assume the issues with superscript identifiers (such as Einstein notation etc.) from Schubotz et al. (2016b) carry over into Schubotz et al. (2017). The rulebased approach proposed by Alexeeva et al. (2020) attempts to account for such notation (known as wildcards in formula retrieval). They propose future methods should combine grammar with a learning framework, extend rule sets to account for coordinate constructions, and create wellannotated training data using tools such as PDFAlign and others Asakura et al. (2021).
2.2 Formula Retrieval
We discuss approaches related to the NTCIR11/12 MathIR Wikipedia Formula Retrieval/Browsing Tasks Zanibbi et al. (2016a). Similar to NTCIR11, the NTCIR12 MathIR Task objective is to build math information retrieval (MIR) systems that enable users to search for a particular math concept using math formulae. Given a query which
contains a target formula expressed in MathML and several
related keywords, each participating system in this task is
expected to return a ranked list of the relevant retrieval units
containing formulae matching the query Kristianto et al. (2016).
Combining formula tree representations improves retrieval. Two main tree representations of formulae exist: Symbol Layout Trees (SLTs) and Operator Trees (OPTs), shown in Figure 4.
Approaches reliant solely on SLTs, such as the early versions of the Tangent retrieval system Pattaniyil and Zanibbi (2014); Zanibbi et al. (2015, 2016b), or solely OPTs Zhong and Zanibbi (2019); Zhong et al. (2020) tend to return less relevant formulae from queries. OPTs capture formula semantics while SLTs capture visual structure Mansouri et al. (2019). Effective representation of both formula layout and semantics within a single vector allows a model exploit both representations. TangentS Davila and Zanibbi (2017) was the first evolution of the Tangent system to outperform the NTCIR11 Aizawa et al. (2014) overall best performer, MCAT Kristianto et al. (2014b, 2016), which encoded path and sibling information from MathML Presentation (SLTbased) and Content (OPTbased). TangentS jointly integrated SLTs and OPTs by combining scores for each representation through a simple linear regressor. Later, TangentCFT Mansouri et al. (2019) considered SLTs and OPTs through a fastText Bojanowski et al. (2017) ngram embedding model using tree tuples. MathBERT Peng et al. (2021) does not explicitly account for SLTs. They claim that LaTeX codes account for SLTs to some extent and therefore focus on encoding OPTs. They pretrain the BERT Vaswani et al. (2017) model with targeted objectives each accounting for different aspects of mathematical text. They account for OPTs by concatenating node sequences to formula + context BERT input sequences, and by formulating OPTbased structureaware pretraining tasks learned in conjunction with masked language modelling (MLM).
Leafroot path tuples deliver an effective mechanism for embedding relations between symbol pairs. Leafroot path tuples are now ubiquitous in formula retrieval Zanibbi et al. (2015, 2016b); Davila and Zanibbi (2017); Zhong and Zanibbi (2019); Mansouri et al. (2019); Zhong et al. (2020) and their use for NTCIR11/12 retrieval has varied since their conception Stalnaker and Zanibbi (2015). Initially Pattaniyil and Zanibbi (2014) pair tuples were used within a TFIDF weighting scheme, then Zanibbi et al. (2015, 2016b) proposed an appearancebased similarity metric using SLTs, maximum subtree similarity (MSS). OPT tuples are integrated Davila and Zanibbi (2017) later on. Mansouri et al. (2019) treat tree tuples as words, extract ngrams, and learn fastText Bojanowski et al. (2017) formula embeddings. Zhong and Zanibbi (2019); Zhong et al. (2020)
forgo machine learning altogether with an OPTbased heuristic search (Approach0) through a generalisation of MSS
Zanibbi et al. (2016b). Leafroot path tuples effectively map symbolpair relations and account for formula substructure, but there is dispute on how best to integrate them into existing machine learning or explicit retrieval frameworks. There exists contest between welldeveloped similarity heuristics Zhong and Zanibbi (2019) and embedding techniques Mansouri et al. (2019), despite their complementarity.Purely explicit methods still deliver competitive results. TangentCFT Mansouri et al. (2019) and MathBERT Peng et al. (2021) are two models to employ learning techniques beyond the level of linear regression. Each model is integrated with Approach0 Zhong and Zanibbi (2019) through the linear combination of individual model scores. This respectively forms the TanApp and MathApp baselines, the stateoftheart in formula retrieval for nonwildcard queries. Approach0 achieves the highest full Bpref score Peng et al. (2021) of the individual models, and highlights the power of explicit methods.
Formula retrieval limitations. Zhong and Zanibbi (2019) propose supporting query expansion of math synonyms to improve recall, and note that Approach0 does not support wildcard queries. Zhong et al. (2020) later provides basic support for wildcards. TangentCFT also does not evaluate on wildcard queries, and the authors suggest extending the test selection to include more diverse formulae, particularly those that are not present as exact matches. They propose integrating nearby text into learned embeddings. MathBERT Peng et al. (2021) performs such integration, but does not learn ngram embeddings. MathBERT evaluates on nonwildcard queries only.
2.3 Informal Premise Selection
Formal and informal premise selection both involve the selection of relevant statements that are useful for proving a given conjecture Irving et al. (2016); Wang et al. (2017); Ferreira and Freitas (2020a). The difference lies in the language from which the premises and related proof elements are composed, and their compatibility with existing Automated Theorem Provers (ATPs). Informal language is not compatible with existing provers without autoformalisation Wang et al. (2020); a current bottleneck Irving et al. (2016). Typically, when reasoning over large formal libraries comprising thousands of premises, the performance of ATPs degrades considerably, while for a given proof only a fraction of the premises are required to complete it Urban et al. (2010); Alama et al. (2014). Theorem proving is essentially a search problem with a combinatorial search space, and the goal of formal premise selection is to reduce the space, making theorem proving tractable Wang et al. (2017). While formal premises are written in the languages of formal libraries such as Mizar Rudnicki (1992), informal premises (and theorems) as seen in ProofWiki^{2}^{2}2https://proofwiki.org/wiki/Main_Page are written in combinations of natural language and LaTeX Ferreira and Freitas (2020a); Welleck et al. (2021a). Proposed approaches either rank Han et al. (2021) or classify Ferreira and Freitas (2020b, 2021) candidate premises for a given proof, detached from formal libraries and ATPs. Informal premise selection is a recently emerging field. Figure 1 describes it as a midspectrum task between retrieval and abstraction. Premise selection models select from existing text without explicitly reasoning beyond it. However, proficient models may be somewhat logical by proxy through the very nature of selecting premises for mathematical reasoning chains. An example of informal premise selection is expressed through the natural language premise selection task, where, given a new conjecture that requires a mathematical proof and a collection (or knowledge base) of premises , with size , the goal is to retrieve premises most likely to be useful for proving Ferreira and Freitas (2020a, b). This is formulated as a classification problem. Alternatively, Welleck et al. (2021a) propose mathematical reference retrieval as an analogue of premise selection. The goal is to retrieve the set of references (theorems, lemmas, definitions) that occur in its proof, formulated as a ranking problem (retrieval).
Separate mechanisms for representing mathematics and natural language can improve performance. Regardless of the task variation, current approaches Ferreira and Freitas (2020b); Welleck et al. (2021a); Han et al. (2021); Coavoux and Cohen (2021) tend to jointly consider mathematics and language as a whole, not specifically accounting for aspects of each modality. Leading approaches for formula retrieval Peng et al. (2021); Mansouri et al. (2019) or solving math word problems Kim et al. (2020); Liang et al. (2021); Zhang et al. (2022) do not follow this trend. Ferreira and Freitas (2020b) extract a dependency graph representing dualmodality mathematical statements as nodes, and formulate the problem as link prediction Zhang and Chen (2018) similar to variable typing Stathopoulos et al. (2018). Other transformerbased or selfattentive baselines Ferreira and Freitas (2020b); Welleck et al. (2021a); Han et al. (2021); Coavoux and Cohen (2021) also do not separate mathematical elements from natural language. They consider notation with the same depth as wordlevel tokens and encode them similarly. Research in neuroscience Butterworth (2002); Amalric and Dehaene (2016) suggests the brain handles mathematics separately to natural language: approaches in premise selection Ferreira and Freitas (2021) and other tasks Peng et al. (2021); Zhang et al. (2022) have prospered from encoding mathematics through a separate mechanism to that of natural language. Ferreira and Freitas (2021) purposefully separate the two modalities, encoding each using selfattention and combining them with a bidirectional LSTM. Explicit disentanglement of the modalities forces the model to exploit latent relationships between language and mathematics through the LSTM layer.
Informal premise selection limitations. Limitations involve a lack of structural consideration of formulae and limited variable typing capabilities. Ferreira and Freitas (2020b) note that the graphbased approach to premise selection as link prediction struggles to encode mathematical statements which are mostly formulae, and suggest inclusion of structural embeddings (e.g. MathBERT Peng et al. (2021)) and training BERT on a mathematical corpus. They also describe value in formulating sophisticated heuristics for navigating the premises graph. Later, following a Siamese network architecture Ferreira and Freitas (2021) reliant on duallayer word/expression selfattention and a BiLSTM (STAR), the authors demonstrate that STAR does not appropriately encode the semantics of variables. They suggest that variable typing and representation are a fundamental component of encoding mathematical statements. Han et al. (2021) plan to explore the effect of varying pretraining components, testing zeroshot performance without contrastive finetuning, and unsupervised retrieval. Coavoux and Cohen (2021) propose a statementproof matching task akin to informal premise selection, with a solution reliant on a selfattentive encoder and bilinear similarity function. The authors note model confusion due to the proofs introducing new concepts and variables rather than referring to existing concepts.
2.4 Math Word Problems
Solving math word problems dates back to the dawn of artificial intelligence
Feigenbaum et al. (1963); Bobrow (1964); Charniak (1969). It can be defined as the task of translating a paragraph into a set of equations to be solved Li et al. (2020). We focus on trends in the task since 2019, as a detailed survey Zhang et al. (2019) captures prior work.Use of dependency graphs are instrumental to support inference. In graphbased approaches to solving MWPs, embeddings of words, numbers, or relationship graph nodes, are learned through graph encoders which feed information through to tree (or sequence) decoders. Embeddings are decoded into expression trees which determine the problem solution. Li et al. (2020) learn the mapping between a heterogeneous graph representing the input problem, and an output tree. The graph is constructed from word nodes with relationship nodes of a parsing tree. This is either a dependency parse tree or constituency tree. Zhang et al. (2020) represent two separate graphs: a quantity cell graph associating descriptive words with problem quantities, and a quantity comparison graph which retains numerical qualities of the quantity, and leverages heuristics to represent relationships between quantities such that solution expressions reflect a more realistic arithmetic order. Shen and Jin (2020) also extract two graphs: a dependency parse tree and numerical comparison graph. Zhang et al. (2022) construct a heterogeneous graph from three subgraphs: a wordword graph containing syntactic and semantic relationships between words, a numberword graph, and a number comparison graph. Their model is the best performing graphbased approach to date. Although other important differences exist (such as decoder choice), it seems that models explicitly relating multiple linguistic aspects of problem text tend to deliver better problem solving.
Multiencoders and multidecoders improve performance by combining complementary representations. Another impactful architectural decision is the choice of encoder/decoder. To highlight this, we consider the following comparison. Shen and Jin (2020) and Zhang et al. (2020) each extract two graphs from the problem text. One is a number comparison graph, and the other relates wordword pairs Shen and Jin (2020) or wordnumber pairs Zhang et al. (2020). They both encode two graphs rather than one heterogeneous graph Li et al. (2020); Zhang et al. (2022). They both use a similar treebased decoder Xie and Sun (2019). A key difference is that Shen and Jin (2020) includes an additional sequencebased encoder and decoder. The sequencebased encoder first obtains a textual representation of the input paragraph, then the graphbased encoder integrates the two encoded graphs. Then treebased and sequencebased decoders generate different equation expressions for the problem with an additional mechanism for optimising solution expression selection. In their own work, Shen and Jin (2020) demonstrate the impact of multiencoders/decoders over each encoder/decoder option individually through ablation.
Goaldriven decompositional treebased decoders are a significant component in the stateoftheart. Introduced in Xie and Sun (2019), this class of decoder is considered by all but three discussed models, as shown in Figure 5 and extends to nongraphbased models Qin et al. (2021); Liang et al. (2021). In GTS, goal vectors guide construction of expression subtrees (from token node embeddings) in a recursive manner, until a solution expression tree is generated. Proposed models do expand on the GTSbased decoder through the inclusion of semanticallyaligned universal expression trees Qin et al. (2020, 2021), though this adaptation is not as widely used. The stateoftheart Liang et al. (2021); Zhang et al. (2022) approaches follow the GTS decoder closely.
Language models that transfer knowledge learned from auxiliary tasks rival models based on explicit graph representation of problem text. As an alternative to encoding explicit relations through graphs, other work Kim et al. (2020); Qin et al. (2021); Liang et al. (2021) relies on pretrained transformerbased models, and those which incorporate auxiliary tasks assumed relevant for solving MWPs to latently learn such relations. However, it seems the case that auxiliary tasks alone do not deliver competitive performance Qin et al. (2020) without the extensive pretraining efforts with large corpora, as we see with BERTbased transformer models. These use either both the (ALBERT Lan et al. (2019)) encoder and decoder Kim et al. (2020), or BERTbased encoder with goaldriven treebased decoder Liang et al. (2021).
Math word problem limitations. In Graph2TreeZ Zhang et al. (2020), they suggest considering more complex relations between quantities and language, and introducing heuristics to improve solution expression generation from the treebased decoder. In EPT, Kim et al. (2020)
find error probability related to fragmentation issues increases exponentially with number of unknowns, and propose generalising EPT to other MWP datasets. HGEN
Zhang et al. (2022) note three areas of future improvement: Combining models into a unified framework through ensembling multiple encoders (similar to Ferreira and Freitas (2021)); integrating external knowledge sources (e.g. HowNet Dong and Dong (2003), Cilin HongMinh and Smith (2008)); and realworld dataset development for unsupervised or weakly supervised approaches Qin et al. (2020).2.5 Informal Theorem Proving
Formal automated theorem proving in logic is among the most advanced and abstract forms of reasoning materialised in the AI space. There are two major bottlenecks Irving et al. (2016) formal methods must overcome: (1) translating informal mathematical text into formal language (autoformalisation
), and (2) a lack of strong automated reasoning methods to fill in the gaps in already formalised humanwritten proofs. Informal methods either tackle autoformalisation directly
Wang et al. (2020); Wu et al. (2022), or circumvent it through language modellingbased proof generation Welleck et al. (2021a, b), trading formal rigour for flexibility. Transformerbased models have been proposed for mathematical reasoning Polu and Sutskever (2020); Rabe et al. (2020); Wu et al. (2021). Converting informal mathematical text into forms interpretable by computers Kaliszyk et al. (2015a, b); Szegedy (2020); Wang and Deng (2020); Meadows and Freitas (2021) is closer to the realworld reasoning and communication format followed by mathematicians.Autoformalisation could be addressed through approximate translation and exploration rather than direct machine translation. A longstudied and extremely challenging endeavour Zinn (1999, 2003); autoformalisation involves converting informal mathematical text into language interpretable by theorem provers Kaliszyk et al. (2015b); Wang et al. (2020); Szegedy (2020). Kaliszyk et al. (2015b) propose a statistical learning approach for parsing ambiguous formulae over the Flyspeck formal mathematical corpus Hales (2006). Later, thanks to improved machine translation capabilities Luong et al. (2017); Lample et al. (2018); Lample and Conneau (2019), Wang et al. (2020) explore dataset translation experiments between LaTeX code extracted from ProofWiki, and formal libraries Mizar Rudnicki (1992) and TPTP Sutcliffe and Suttner (1998)
. The supervised RNNbased neural machine translation model
Luong et al. (2017) outperforms the transformerbased Lample et al. (2018) and MLM pretrained transformerbased Lample and Conneau (2019) models, with the performance boost stemming from its use of alignment data. Szegedy (2020) advises against such direct translation efforts, instead proposing a combination of exploration and approximate translation through predicting formula embeddings. In seq2seq models, embeddings are typically granular, encoding wordlevel or symbollevel Jo et al. (2021) tokens. The suggestion is to learn mappings from natural language input to premise statements nearby the desired statement in the embedding space, traversing the space between statements using a suitable prover Bansal et al. (2019). Guided mathematical exploration for realworld proofs is still an unaddressed problem and does not scale well with stepdistance between current and desired formulae. It may be easier to continue with direct translation Wang et al. (2020). For example, Wu et al. (2022) report promising results, directly autoformalising small competition problems to Isabelle statements using language models. Similar to previous suggestion Szegedy (2020), they also autoformalize statements as targets for proof search with a neural theorem prover.Need for developing robust interactive natural language theorem provers. We discuss the closest equivalent to formal theorem proving in an informal setting. Welleck et al. (2021a) propose a mathematical reference generation task. Given a mathematical claim, the order and number of references within a proof are predicted. A reference is a theorem, definition, or a page that is linked to within the contents of a statement or proof. Each theorem has a proof containing a sequence of references , for references . Where the retrieval task assigns a score to each reference in , the generation task produces a variable length of sequence of references with the goal of matching , for which a BERTbased model is employed and finetuned on various data sources. Welleck et al. (2021b) expand on their proof generation work, proposing two related tasks: nextstep suggestion, where a step from a proof (as described above) is defined as a sequence of tokens to be generated, given the previous steps and the claim ; and fullproof generation which extends this task to generate the full proof. They employ BART Lewis et al. (2019), an encoderdecoder model pretrained with denoising tasks, and augment the model with reference knowledge using FusioninDecoder Izacard and Grave (2020). The intermediate denoising training and knowledgegrounding improve model performance by producing better representations of (denoised) references for deployment at generation time, and by encoding referenceaugmented inputs. Aiming towards automatic physics derivation, Meadows and Freitas (2021) propose an equation reconstruction task similar to nextstep suggestion, where, given a sequence of LaTeX strings from a computer algebra physics derivation , the intermediate string is removed, and must be rederived. The similaritybased heuristic search selects two consecutive computer algebra operations from a knowledge base and sequentially applies them to , in order to derive the known equation . If is obtained, then the equation after the first operation is taken as , and a partial derivation is achieved.
Informal theorem proving limitations. Wang et al. (2020) suggest the development of highquality datasets for evaluating translation models, including structural formula representations, and jointly embedding multiple proof assistant libraries to increase formal dataset size. Szegedy (2020) argues that reasoning systems based on selfdriven exploration without informal communication capabilities would suffer usage and evaluation difficulties. Wu et al. (2022) note limitations with text window size and difficulty storing large formal theories with current language models. After proposing the NaturalProofs dataset, Welleck et al. (2021a) characterize error types for the fullproof generation and nextstep suggestion tasks, noting issues with: (1) hallucinated references, meaning the reference does not occur in NaturalProofs; (2) nongroundtruth reference, meaning the reference does not occur in the groundtruth proof; (3) undefined terms; and (4) improper or irrelevant statement, meaning a statement that is mathematically invalid (e.g. ) or irrelevant to the proof; and (5) statements that do not follow logically from the preceding statements. Dealing with researchlevel physics, Meadows and Freitas (2021) note that the cost of semiautomated formalisation is significant and does not scale well, requiring detailed expertlevel manual intervention. They also call for a set of welldefined computer algebra operations such that robust mathematical exploration can be guided in a goalbased setting.
3 Conclusion
In this work we deliver a synthesis of the recent evolutionary arch for strategic areas in mathematical language processing. We systematically describe the methods, challenges and trends within each area, eliciting consolidated modelling components and emerging methodological advances. In areas related to variable typing and formula retrieval, explicit methods compete with and complement embedding models. In word problem solving involving simpler mathematics, dependency graphs explicitly represent relationships between numerical tokens and language. Models either encode graph input or sequence input, and decode to solution expression trees via recursive goaldriven tree decoders. Research with multiencoders/decoders suggests value in combining representations. For advanced mathematics, languagebased premise selection models also use graphbased and transformerbased models, mostly learning formulae and language embeddings without integrating formula structure or variable typing. Limited autoformalisation of informal mathematics exists through machine translation, but it is elsewhere argued that approximate translation to related premises followed by exploration is more promising. Some circumvent formal libraries altogether through flexible proof generation, physics derivation, and premise selection in less formal environments. We hope future techniques will benefit from this synthesis.
