Math equations are widely used in the fields of Science, Technology, Engineering, and Mathematics (STEM). However, it is often daunting for students to understand math content and equations when they are reading STEM publications [11, 7]. Because of the Web, students post detailed math questions online for help. Recent question systems, such as Mathematics Stack Exchange111https://math.stackexchange.com and MathOverflow222https://mathoverflow.net, attempt to address this need. From the viewpoint of questioners, the contents of detailed math questions are usually complex and long. In order to efficiently help those who pose the question, it would be helpful to have a headline which is concise and to the point. Correspondingly, those who will answer the question (answerers) also need a clear and brief headline to quickly determine if they should bother to respond. Therefore, giving a concise math headline to a detailed question is important and meaningful. Figure 1 illustrates an example of the question along with its headline posted in Mathematics Stack Exchange333https://math.stackexchange.com/questions/3331385. It’s clear that, a complicated question can make it difficult for answerers to understand the intent of the questioner, while a concise headline can effectively reduce the cost of this operation.
To this end, we explore a novel approach for geNerating A Math hEadline for detailed questions (NAME). Here, we define the NAME task as a summarization task. Compared to conventional summarization tasks, the NAME task has two extra essential issues that need to be addressed: 1) Jointly modeling text and math equations in a unified framework. Textual (words) and mathematical (equations) information are usually coexisting in detailed questions and brief headlines, as shown in Figure 1. As such, it is natural and necessary to process in some way text and math equations together [17, 24]. However, it is not evident how to model this in a unified framework. For instance, Yasunaga and Lafferty  attempted to utilize both text and mathematical representations, but both were treated as separate components. We argue that this approach loses much crucial information, e.g., the position and the semantic dependency between text and equations. 2) Capturing semantic and structural features of math equations synchronously. Unlike text, math equations not only contain semantic features, but also structural features. For instance, equation “” and “” have the same semantic features, but different structural features. However, most existing research separately considers only one of these two characteristics. For instance, this work [26, 27] only considered the structural information of equations for mathematical information retrieval tasks while other work [2, 24] treated a math equation as basic symbols and modeled them as text, which led to structural features loss.
To address these issues, we propose MathSum, a novel method that combines pointers with multi-head attention for mathematical representation augmentation. The pointer mechanism can either copy textual tokens or math tokens from source questions in order to generate math headlines. The multi-head attention mechanism is designed to enrich the representation of each math equation separately by modeling and integrating both semantic and structural features. For evaluation, we construct two large datasets (EXEQ-300k and OFEQ-10k) which contain 290,479 and 12,548 detailed questions with corresponding math headlines from Mathematics Stack Exchange and MathOverflow, respectively. We compare our model with several abstractive and extractive baselines. Experimental results demonstrate that our model significantly outperforms several strong baselines on the NAME task.
In summary, the contributions of our work are:
an innovative NAME task for generating a concise math headline in response to giving a detailed math question.
a novel summarization model MathSum that addresses the essential issues of the NAME task, in which the textual and mathematical information can be jointly modeled in a unified framework; while both semantic and structural features of math equations can be synchronously captured.
novel math datasets444https://github.com/yuankepku/MathSum. To the best of our knowledge, these are the first mathematical content/question datasets associated with headline information.
Mathematical Equation Representation
Unlike text, math equations are often highly structured. They not only contain semantic features, but also structural features. Recent work [16, 27, 25, 7] focused mainly on the structural features of math equations, and utilized tree structures to represent equations for mathematical information retrieval and mathematical word problem solving. Other work [5, 8, 24] instead focused mainly on the semantic features of equations. They processed an equation as a sequence of symbols in order to learn its representation.
Mathematical Equation Generation
Similar to text generation, math equation generation has been widely explored. Recent work[2, 29, 9] utilized an end-to-end framework to generate equations from mathematical images, e.g., handwritten math equations. Other work [16, 22] inferred math equations for word problem solving. However, this work only supported limited types of operators (i.e., , , , ). The work  most related to ours created a model to generate equations given specific topics (e.g., electric field). Our task (NAME), instead, aims at generating math headlines from both equations and text without clear topics. Thus, our NAME is quite challenging since it requires models to generate correct equations in the correct positions in the generated headlines.
|Datasets||avg. math num||avg. text tokens||avg. math tokens||avg. sent. num||text vocab. size||math vocab. size|
|datasets||question pairs||correct question pairs|
Summarization and Headline Generation
Summarization, a fundamental task in Natural Language Processing (NLP), can be categorized basically into extractive methods and abstractive methods. Extractive methods[12, 14] extract sentences from the original document to form the summary. Abstractive methods [18, 19, 13, 6] aim at generating the summary based on understanding the document.
We view headline generation as a special type of summarizaton, with the constraint that only a short sequence of words is generated and that it preserves the essential meaning of a math question document. Recently, headline generation methods with end-to-end frameworks [20, 13, 28, 6] achieved significant success. Math headline generation is similar to existing headline generation tasks, but still differs in several aspects. The major difference is that a math headline consists of text and math equations which require jointly modeling and inferring text and math equations.
Task and Dataset
Let us define the NAME task as a summarization one. Let denote the sequence of the input detailed question. is the number of tokens in the source, , represents the textual token (word), and indicates the math token555Math token is the fundamental element which can form a math equation. For each input , there is a corresponding output math headline with tokens where and , are textual tokens and math tokens, respectively. The goal of NAME is to generate a math headline learned from the input question, namely, .
Since this NAME task is new, we could find no public benchmark dataset. As such, we build two real-world math datasets, EXEQ-300k (from Mathematics Stack Exchange) and OFEQ-10k (from MathOverflow), for model training and evaluation. Both datasets consist of detailed questions with corresponding math headlines.
In EXEQ-300k and OFEQ-10k, each question is written in detailed math, and the corresponding headline is a human-written question summary with math equations, typically by the questioner. In Mathematics Stack Exchange and MathOverflow, math equations are enclosed by the “$$” symbols. We use in our datasets “m” and “/m” to replace “$$” in order to indicate the begin and end of an equation. In addition, The toolkit Stanford CoreNLP666https://stanfordnlp.github.io/CoreNLP/ and LaTeX tokenizer in im2mark777https://github.com/harvardnlp/im2markup are used to tokenize separately the text and equations in questions and headlines.
Specifically, we collect 346,202 pairs of detailed questions, math headline from Mathematics Stack Exchange and 13,408 pairs from MathOverflow. To help with analysis and ensure quality, we remove pairs which contain math equations that cannot be tokenized by LaTeX tokenizer. This results in 290,479 pairs from Mathematics Stack Exchange which form EXEQ-300k and 12,548 pairs from MathOverflow which form OFEQ-10k. See Table 1 and Table 2 for more details. In EXEQ-300k, on average there are respectively 6.08 and 1.72 math equations in the question and headlines. In contrast, OFEQ-10k contains more math equations in the question (8.56) and less in the headline (1.41). In EXEQ-300k the questions have 60.65 textual tokens and 12.27 math tokens on average, while the headline has 7.72 textual tokens and 9.91 math tokens on average. Correspondingly, in OFEX-10k, there are on average 105.92 textual tokens and 10.04 math tokens in the question, and on average 10.04 textual tokens and 6.84 math tokens in the headline. Compared to EXEQ-300k, OFEX-10k contains more tokens (textual token and math token) in questions, and less in headlines. From Figure 2, we also see that OFEQ-10k has a higher proportion of novel n-grams than EXEQ-300k. Based on the above observations, we believe that the constructed datasets are significantly different and mutually complementary.
Here we describe our proposed deep model, MathSum, which we designed for the NAME task.
As shown in Figure 3, MathSum utilizes a pointer mechanism with a mutli-head attention mechanism for mathematical representation augmentation. It consists of two main components: (1) an encoder which jointly learns the representation of math equations and text, (2) a decoder which learns to generate headlines from the learned representation.
For the encoder, the crucial issue is to build effective representations for tokens in an input question. As mentioned in NAME task, there are two different token types (i.e., textual and math) and their characteristics are intrinsically different. Math tokens not only contain the semantic features (mathematical meaning) but also the structural features (e.g., super/sub script, numerator/denominator, recursive structure). Therefore, the representation learning should vary according to the token type. In this study, we introduce a multi-head attention mechanism to enrich the representation of math tokens.
The token of the input question is first converted into a continuous vector representation , so that the vector representation of the input is where is the number of tokens in the input and , are vector representation of textual and math tokens, respectively. Then the vectors of math tokens within an equation are fed into a block with multi-head attention  which then enriches its representation by considering both its semantic and structural features. Please note that each equation in the input will be separately fed into the block since an equation is a fundamental unit for characterizing the semantic and structural features of a series of math tokens. Let denote the initial vector representation of the -th math equation with math tokens as input. Then the multi-head attention block transforms the to its enriched representation . This is calculated by
where is the multi-head attention block. is the beginning index of math equation and is the end index.
After that, the enriched vector representation of the input is where is fed into the update layer (a single-layer bidirectional LSTM) one-by-one. The hidden state is updated according to the previous hidden state and current token vector ,
where is the dynamic function of LSTM unit and is the hidden state of token in the step .
In the decoder, we aggregate the encoder hidden states using a weighted sum that then becomes the context vector :
,, and are the learnable parameters. is the hidden state of the decoder at time step . The attention is the distribution over the input position.
At this point, the generated math headline may contain textual tokens or math tokens from the source which could be out-of-vocabulary. Thus, we utilize a pointer network  to directly copy tokens from source. Considering that the token
maybe copied from the source or generated from the vocabulary, we use the copy probabilityas a soft switch to choose copied tokens from the input or generated textual tokens from the vocabulary.
where is non-linear function and is the decoder input at timestep .
Finally, the training loss at time step is defined as the negative log likehood of the target word where
Comparison of Methods
We compare our model with baseline methods on both the EXEQ-300k and OFEQ-10k for the NAME task. Four extractive methods are implemented as baselines: Random, randomly selects a sentence from the input question. Lead, simply selects the leading sentence from the input question, while Tail selects the last sentence, and TextRank888For TextRank, we use the implementation in summanlp, https://github.com/summanlp/textrank extracts sentences from the text according to their scores computed by an algorithm similar to PageRank. In addition, three abstractive methods999We use the implementation of OpenNMT, https://github.com/OpenNMT/OpenNMT-py are also used to compare against MathSum. Seq2Seq is a sequence to sequence model based on the LSTM unit and attention mechanism . PtGen is a pointer network which allows copying tokens from the source . Transformer
is a neural network model that is designed based on a multi-head attention mechanism.
We randomly split EXEQ-300k into training (90%, 261,341), validation (5%, 14,564), and testing (5%, 14,574) sets. In order to get enough testing samples, we split OFEQ-10k in a 80% training (10,301), 10% validation (1,124), and 10% testing (1,123) proportions101010For a fair comparison, all models used the same experimental data setup. For EXEQ-300k, all models are trained and tested on the same dataset. For OFEQ-10k, in order to achieve better experimental results, all models are first trained on the training set of EXEQ-300k, then fine-tuned and tested using OFEQ-10k.
For our experiments, the dimensionality of the word embedding is 128 and the number of hidden states for LSTM units for both encoder and decoder is 512. The multi-head attention block contains 4 heads and 256-dimensional hidden states for the feed-forward part. The model is trained using AdaGrad 
with a learning rate of 0.2, an initial accumulator value of 0.1, and a batch size of 16. Also, we set the dropout rate as 0.3. The vocabulary size of the question and headline are both 50,000. In addition, the encoder and decoder share the token representations. At test time, we decode the math headline using beam search with beam size of 3. We set the minimum length as 20 tokens on EXEQ-300k and 15 tokens on OFEQ-10k. We implement our model in PyTorch and train on a single Titan X GPU.
|Edit Distance(m)||Edit Distance(s)||Exact Match||Edit Distance(m)||Edit Distance(s)||Exact Match|
Comparison of different models on the EXEQ-300k and OFEQ-10k test sets according to math evaluation metrics. Edit Distance(m) and Edit Distance(s) evaluate those that are dissimilar (the smaller the better). Exact Match is the number of math tokens accurately generated in math headlines (the larger the better).
Here we use three standard metrics: ROUGE , BLEU  and METEOR  for evaluation. The ROUGE metric measures the summary quality by counting the overlapping units (e.g., n-gram) between the generated summary and reference summaries. We report the F1 scores for R1 (ROUGE-1), R2 (ROUGE-2), and RL (ROUGE-L). The BLEU score is a widely used as an accuracy measure for machine translation and computes the n-gram precision of a candidate sequence to the reference. METEOR is recall-oriented and evaluates translation hypotheses by aligning them to reference translations and calculating sentence-level similarity scores. The BLEU and METEOR scores are calculated by using nlg-eval111111https://github.com/Maluuba/nlg-eval package, and ROUGE scores are based on rouge-baselines121212https://github.com/sebastianGehrmann/rouge-baselines package.
We use the edit distance and exact match to check the similarity of the generated equations compared with the gold standard equations in the math headlines. These two metrics are widely used for the evaluation of equation generation [2, 23]. Edit distance quantifies how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. Based on samples in the test set, we use two types of edit distance. One is Edit Distance(m) which is math-level dissimilar score and is defined as , where is the minimum edit distance between equations in the generated headline and the gold standard headline, and are the number of equations in the -th generated headline and gold headline. The other Edit Distance(s) is the sentence-level dissimilar score, and is formulated as . Exact Match checks the exact match accuracy between the gold standard math tokens and generated math tokens and is calculated as , where and are the sets of math tokens in the -th generated headline and gold standard headline.
Comparisons of models can be found in Table 3. All models perform better on EXEQ-300k than OFEQ-10k. A possible explanation is that the EXEQ-300k contains a lower proportion of novel n-grams in its gold standard math headlines (illustrated in Figure 2). For extractive models, we find that Lead obtains a good performance on EXEQ-300k, while TextRank performs well on OFEX-10k. Since OFEX-10k contains more sentences for each question, TextRank is more likely to pick out the accurate sentence. Unsurprisingly, abstractive models perform better than extractive models on both datasets. Compared to ordinary Seq2Seq, PtGen gets better performance, since it uses a copying strategy to directly copy tokens from the source question. The transformer can outperform PtGen, which implies that by utilizing multi-head attention mechanism, we obtain a better learning of representation. MathSum significantly outperforms other models for all evaluation metrics on both datasets. Thus, MathSum initially addresses some of the challenges of NAME task and generates satisfactory headlines for questions.
In addition, we also evaluate the gap between the generated headlines and human-written headlines. The Edit Distance(m), Edit Distance(s) and Exact Match scores for different models using EXEQ-300k and OFEQ-10k are shown in Table 4. The results show that extractive models perform worse, if we use the metric Edit Distance(s) instead of Edit Distance(m) for evaluation. Since extractive models directly select sentences from source questions, some selected sentences may not contain math equations. For abstractive baselines, the Transformer obtains the best performance. This observation reinforces the claim that a mutli-head attention mechanism can construct a better representation for math equations. On EXEQ-300k, our model, MathSum, achieves the best performance on all metrics. On OFEX-10k, MathSum gets the best performance for Exact Match and second best performance (slightly weaker than Transformer) for Edit Distance(m) and Edit Distance(s). A possible reason is that in OFEX-10k, the lengths of math equations in source questions are usually long, while the ones in headlines are often short. Compared to the Transformer, the copying mechanism could cause MathSum to copy long equations from the source questions, which may result in a slight decreased performance for Edit Distance(m) and Edit Distance(s) metrics.
Jointly modeling quality
The heatmap in Figure 4 visualizes the attention weights from MathSum. Figure 4(a) compares the source detailed question with its human-written math headline and the generated math headline from MathSum. As Figure 4 shows, there are both textual tokens and math tokens in the generated headline. Note that both math tokens and textual tokens can be effectively aligned to their corresponding tokens in the source. For instance, the textual tokens “coordinate”, “triangle” and the math tokens “”, “” are both all successfully aligned.
To gain an insightful understanding regarding the generation quality of our method, we present three typical examples in Table 5. The first two are selected from EXEQ-300k131313https://math.stackexchange.com/questions/2431575141414 https://math.stackexchange.com/questions/752067 and the last one is selected from OFEQ-10k151515https://mathoverflow.net/questions/291434. From the examples, we see that the generated headlines and the human-written headlines have comparability and similarity. Generally, the generated headlines are coherent, grammatical, and informative. We also observe that, it is important to locate the main equations for NAME task. If the generation method emphasizes a subordinate equation, it will generate an unsatisfactory headline, such as the second example in Table 5.
Conclusions and Future Work
Here we define and explore the novel NAME task of automatic headline generation for online math questions using a new deep model, MathSum. Two new datasets (EXEQ-300k and OFEQ-10k) are constructed for algorithm training and testing and are made available. Our experimental results demonstrate that our model can often generate useful math headlines and significantly outperform a series of state-of-the-art models. Future work could focus on enriched representations of math equations for mathematical information retrieval and other math-related research.
|Partial Math Detailed Question (EXEQ-300k)||So I am asked to find the inverse elements of this set (I know that this is the set of Gaussian integers). I was pretty much do…|
|Human-Written||finding the inverse elements of|
|MathSum||finding the inverse elements of|
|Partial Math Detailed Question (EXEQ-300k)||Suppose that the function is continuously differentiable. Define the function by…|
using the chain rule in
|Partial Math Detailed Question (OFEQ-10k)||In the paper of Herbert Clemens Curves on generic hypersurfaces the author shows that for a generic hypersurface of of sufficiently high degree there is no rational…|
|Human-Written||rational curves in and immersion|
|MathSum||rational curves in|
This work is partially supported by China Scholarship Council and projects of National Natural Science Foundation of China (No. 61876003 and 61573028), Guangdong Basic and Applied Basic Research Foundation (2019A1515010837) and Fundamental Research Funds for the Central Universities (18lgpy62), and the National Science Foundation.
-  (2015) Neural machine translation by jointly learning to align and translate. In proceedings of ICLR. Cited by: Comparison of Methods.
Image-to-markup generation with coarse-to-fine attention.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 980–989. Cited by: Introduction, Mathematical Equation Generation, Metrics, footnote 5.
-  (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380. Cited by: Metrics.
-  (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: Experiment Settings.
-  (2017) Preliminary exploration of formula embedding for mathematical information retrieval: can mathematical formulae be embedded like a natural language?. arXiv preprint arXiv:1707.05154. Cited by: Mathematical Equation Representation.
Self-attentive model for headline generation. In European Conference on Information Retrieval, pp. 87–93. Cited by: Summarization and Headline Generation, Summarization and Headline Generation.
-  (2018) Mathematics content understanding for cyberlearning via formula evolution map. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 37–46. Cited by: Introduction, Mathematical Equation Representation.
-  (2018) Equation embeddings. arXiv preprint arXiv:1803.09123. Cited by: Mathematical Equation Representation.
-  (2019) Pattern generation strategies for improving recognition of handwritten mathematical expressions. arXiv preprint arXiv:1901.06763. Cited by: Mathematical Equation Generation.
-  (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: Metrics.
-  (2014) An interactive metadata model for structural, descriptive, and referential representation of scholarly output. Journal of the Association for Information Science and Technology 65 (5), pp. 964–983. Cited by: Introduction.
-  (2004) Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: Summarization and Headline Generation.
Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ACL. Cited by: Summarization and Headline Generation, Summarization and Headline Generation.
Learning to generate coherent summary with discriminative hidden semi-markov model. In Proceedings of COLING 2014, pp. 1648–1659. Cited by: Summarization and Headline Generation.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Metrics.
-  (2016) Equation parsing: mapping sentences to grounded equations. EMNLP. Cited by: Mathematical Equation Representation, Mathematical Equation Generation.
-  (2016) Semantification of identifiers in mathematics for better math information retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 135–144. Cited by: Introduction.
-  (2017) Get to the point: summarization with pointer-generator networks. ACL. Cited by: Summarization and Headline Generation, MathSum Model, Comparison of Methods.
-  (2017) Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1171–1181. Cited by: Summarization and Headline Generation.
-  (2017) From neural sentence summarization to headline generation: a coarse-to-fine approach.. In IJCAI, pp. 4109–4115. Cited by: Summarization and Headline Generation.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: MathSum Model, Comparison of Methods.
Mathdqn: solving arithmetic word problems via deep reinforcement learning. In
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Mathematical Equation Generation.
-  (2018) Image-to-markup generation via paired adversarial learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 18–34. Cited by: Metrics.
-  (2019) TopicEq: a joint topic and mathematical equation model for scientific texts. AAAI. Cited by: Introduction, Mathematical Equation Representation, Mathematical Equation Generation.
-  (2018) Formula ranking within an article. In Proceedings of the 18th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 123–126. Cited by: Mathematical Equation Representation.
-  (2016) A mathematical information retrieval system based on rankboost. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 259–260. Cited by: Introduction.
-  (2016) Multi-stage math formula search: using appearance-based similarity metrics at scale. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 145–154. Cited by: Introduction, Mathematical Equation Representation.
-  (2018) Question headline generation for news articles. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 617–626. Cited by: Summarization and Headline Generation.
-  (2019) An improved approach based on cnn-rnns for mathematical expression recognition. In Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing, pp. 57–61. Cited by: Mathematical Equation Generation.