TopicEq: A Joint Topic and Mathematical Equation Model for Scientific Texts

02/16/2019 ∙ by Michihiro Yasunaga, et al. ∙ Yale University 8

Scientific documents rely on both mathematics and text to communicate ideas. Inspired by the topical correspondence between mathematical equations and word contexts observed in scientific texts, we propose a novel topic model that jointly generates mathematical equations and their surrounding text (TopicEq). Using an extension of the correlated topic model, the context is generated from a mixture of latent topics, and the equation is generated by an RNN that depends on the latent topic activations. To experiment with this model, we create a corpus of 400K equation-context pairs extracted from a range of scientific articles from arXiv, and fit the model using a variational autoencoder approach. Experimental results show that this joint model significantly outperforms existing topic models and equation models for scientific texts. Moreover, we qualitatively show that the model effectively captures the relationship between topics and mathematics, enabling novel applications such as topic-aware equation generation, equation topic inference, and topic-aware alignment of mathematical symbols and words.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Technical scientific articles, such as those from physics and computer science, rely on both mathematics and text to communicate ideas. Most existing work in natural language processing (NLP) and machine learning studies these two components separately. For instance, text-based topic models have been used widely on scientific articles to uncover their semantic structure

[Blei, Ng, and Jordan2003, Blei and Lafferty2006, Newman et al.2010a]. For mathematics, recent work [Lan et al.2015, Zanibbi et al.2016, Deng et al.2017] has studied methods to model and generate mathematical equations, for example using RNNs. However, ultimately these two components should be processed together in a seamless manner. Algorithms for automated understanding of scientific documents should extract the information encoded by not only words but also mathematical equations. At the same time, equations should ideally be modeled with the help of the surrounding text, as the meaning of an equation depends not only on its constituent symbols and syntax, but also on the context in which it appears [Wang et al.2015, Krstovski and Blei2018].

Figure 1: The words in a given technical context often characterize the distinctive types of equations used, and vice versa. Top topic: Relativity; bottom topic: Optimization.

To this end, this paper proposes a topic-equation model that jointly generates equations and their surrounding text in scientific documents (TopicEq), and demonstrates that the model can effectively achieve the aforementioned two goals. The intuition behind the model is illustrated in the sample passages in Figure 1

, which shows how the topic of the word context is often indicative of the distinctive types of equations used, and vice versa. For instance, equations appearing in the topic of relativity (with context words like “back hole”, “Einstein”) tend to involve a series of tensors like

and , while equations used in the topic of optimization (context words “gradient”, “optimal”) may use norms, the operator, and often their combinations. Ideally, the strings of mathematical symbols in the equations should aid the training of topic models, and the context words should aid the modeling and understanding of the equations.

Our model formalizes this intuition for scientific texts by generating each equation and its context passage using a shared latent topic. Specifically, we apply a topic model to the context passage, and use the same latent topic proportion vector in a recurrent neural network (RNN) to generate the equation as a sequence of symbols. To develop and experiment with this model, we construct a large corpus of context-equation pairs, extracted from the LaTeX source of arXiv articles across a range of scientific domains (

ContextEq-400K). We fit the model on this corpus using approximate inference based on a variational autoencoder approach.

Our evaluation shows that this joint model significantly outperforms alternative topic models and RNN equation models for scientific texts. We further show that the model enables novel applications that bridge topics and mathematical equations. Concretely, the paper makes the following contributions.

  • [topsep=0.5mm]

  • The first study of jointly modeling topics and mathematics in scientific texts.

  • Better topic models for scientific texts: Joint training with the RNN equation model boosts the quality of topic modeling. This greatly outperforms the topic model that includes equations simply as bags of tokens, suggesting that equations’ syntax-level information captured by the RNN is useful for topic modeling.

  • Better equation models: Joint topic modeling provides the narrative context for equation prediction, and improves the quality/grammaticality of the RNN equation model.

  • Our model successfully captures the relationship between mathematical equations and topics (words), enabling interpretable handling of equations. For instance, we illustrate that the model enables topic-aware equation generation and equation topic inference. We also present a variant of this model that learns topic-aware associations between mathematical symbols and words.

  • The model is unsupervised, and enables the aforementioned tasks and applications without manual labels.

Related Work

Our work is connected to a wide range of recent research, from topic models to mathematical equation processing.

Topic models.

Topic models provide a powerful tool to extract the semantic structure of texts in the form of the latent topics—usually multinomial distributions over words. Starting from LDA [Blei, Ng, and Jordan2003], topic models have been studied extensively [Teh et al.2005, Blei and Lafferty2006, Blei and Lafferty2007, Hall, Jurafsky, and Manning2008], especially for scientific articles. However, while mathematical equations play an essential role in scientific documents, topic models capable of processing equations besides word texts are yet to be studied. This work shows that incorporating joint modeling of equations via an RNN boosts the performance of topic modeling for scientific texts.

Recent work [Cao et al.2015, Larochelle and Lauly2012] has proposed neural topic models, leveraging the flexibility and representation power of neural networks. In particular, [Miao, Yu, and Blunsom2016, Miao, Grefenstette, and Blunsom2017, Srivastava and Sutton2017] employ neural variational inference to train topic models; we will apply their technique to fit our model.

Language models & equation models.

Language modeling aims to learn a probability distribution over a sequence of words. It is a fundamental task in NLP, with a plethora of applications including text generation. RNN-based language models are shown effective for sequences with long-term dependencies

[Mikolov et al.2010, Jozefowicz et al.2016].

Similar to language models, equation models are useful for various tasks involving equation generation, such as semantic parsing [Roy, Upadhyay, and Roth2016] and handwriting / optical character recognition [Deng et al.2017]. The use of RNNs to model LaTeX was illustrated by [Karpathy2015] for an algebraic geometry text. This work also employs an RNN to model each equation as a sequence of LaTeX tokens (or “symbols,” interchangeably).

Neural topic-language models.

Our model architecture is motivated by joint topic-language models. Such models typically extract latent topics of a given document via a topic model, and utilize the topic knowledge to improve an RNN language model. mikolov2012context mikolov2012context incorporate the topic vector of a pre-trained LDA model into an RNN language model; recent work [Dieng et al.2017, Lau, Baldwin, and Cohn2017, Wang et al.2018] trains neural topic and language models jointly, as we will do here.

Key distinctions can be made between our work and these models. First, while previous work uses topic models to improve language modeling on the same word text, our task models two different modalities: word text and equations. In this sense, our work is related to [Blei and Jordan2003], which extends LDA to model image-text pairs. Moreover, taking advantage of these two modalities, we also present a variant of the TopicEq model that learns topic-aware association between mathematical symbols and words.

The second difference lies in the RNN equation model we propose. While [Dieng et al.2017, Ahn et al.2016, Lau, Baldwin, and Cohn2017] integrate the topic knowledge into either the output layer of the LSTM or the word predictions of the language model, we embed the topic proportion vector inside the LSTM, to enable the topic knowledge to have deeper influence on equation generation. Experimental results show that this method of incorporating topic information is more effective than the existing methods for improving the quality of equation modeling.

Mathematical equation processing.

Some work has processed equations as bags of math symbols to extract their features for searching [Sojka and Líška2011] and clustering [Lan et al.2015]. zanibbi2016multi zanibbi2016multi introduce tree-based representations for equations for mathematical information retrieval tasks. Most recently, deng2017image deng2017image propose RNN-based models to generate equations. We will show that RNN-based equation processing can capture syntactic features of equations, and provides more effective help for topic modeling than bag of token-based equation processing does.

Finally, our work of modeling equations with contexts is related to [Krstovski and Blei2018], which fits equation embeddings using surrounding words. While they limit the equation domains (i.e., ML, AI), this work aims to uncover topics for texts and equations from a range of scientific domains. This work also models each equation itself as a sequence of symbols, which is not studied in their work.

The TopicEq Model

Our starting point is the correlated topic model [Blei and Lafferty2007], which models the topic proportion vector through a latent Gaussian vector. We extend this model to the setting where each “document” consists of a displayed equation eq and its surrounding text , which we call the equation’s context. Our joint model assumes that each equation and its context are generated from the same latent topic vector ; see Figure 2. Concretely, the generative process for a given is

(1)
(2)
(3)

where

. Note that this is equivalent to placing a logistic normal distribution on

where the latent Gaussian has mean and covariance . The parameters , the topics

, and the weights in the LSTM are to be estimated from data. Expressing the model as shown in Figure 

2 emphasizes the connection with neural topic models such as [Miao, Grefenstette, and Blunsom2017]; we will apply their model training technique.

Figure 2: Graphical structure underlying the TopicEq model.

Both the words and the equation are generated in a way that depends on the topic proportion vector . The topics are distributions over a word vocabulary with words; the context words are then drawn from the mixture , similar to [Wang et al.2018]. We employ an RNN to generate eq as a sequence of mathematical tokens, where the vocabulary is extracted from the set of LaTeX tokens. Specifically, to generate an equation conditioned on the latent topic proportion vector (equivalently ), we consider a Topic-Embedded LSTM (TE-LSTM), an extension of the LSTM [Hochreiter and Schmidhuber1997] where the -th update is

Here denotes the concatenation of the current input, previous state and topic proportion vector;

is the sigmoid function and

denotes the Hadamard product. The probability of the next token in the equation is . Thus, the TE-LSTM embeds inside the LSTM cell to reflect the topic knowledge for equation generation. As a joint topic-equation model, it is similar to the topic-language model of [Wang et al.2018].

Writing the equation as a sequence of tokens , the training objective is the marginal likelihood of and eq

(4)

Since its direct optimization is intractable, we employ variational inference [Jordan et al.1999]. Denoting the variational distribution by , we maximize the variational lower bound (ELBO) for the log-likelihood, :
 

(5)

Following recent approaches to neural topic-language models [Miao, Grefenstette, and Blunsom2017, Dieng et al.2017, Wang et al.2018], we compute as a function of the context using the variational autoencoder technique [Kingma and Welling2014]

. Specifically, we use a feed-forward neural network (FFNN) as an inference network to parameterize the mean and variance vectors of the (diagonal) Gaussian variational distribution

. We then use samples from to optimize Eq 5

. The parameters of the inference network, the topic model, and the equation model are jointly trained by stochastic gradient descent.

We also include a topic diversity regularization term to Eq 5, following [Xie, Deng, and Xing2015]. We observed that this technique prevents learning generic, redundant topics.

Experiments

We study the performance of the proposed model on a corpus of context-equation pairs constructed from arXiv articles. We quantitatively show that our joint topic-equation model provides superior fits than alternative topic models and equation models. We further demonstrate its efficacy through qualitative analyses and novel applications, such as equation generation and equation topic inference.

Dataset Construction (ContextEq-400K)

To obtain a dataset of context-equation pairs, we used scientific articles published on arXiv.org. We sampled 100k articles from all domains in the past 5 years, and split them into train, validation and test sets (80%, 10%, 10%). For each article, we parsed its LaTeX source and extracted single-line display equations that have five consecutive sentences both before and after the equation, which are used to define the word context. Following [Deng et al.2017], we further tokenized each equation into a sequence of LaTeX tokens (e.g., \sigma, ^, {, 2, }) and kept those of length 20–150, yielding the final corpus of 400K equation-context pairs. An equation has 63 tokens on average. The context size of 10 sentences is similar to the document size used in recent work of topic-language models [Dieng et al.2017, Wang et al.2018].

Experimental Setup

We fit the TopicEq model end-to-end on the train set and evaluate its performance on the test set.

Preprocessing.

For the topic modeling of context passages, we first removed all the inline math expressions in the text. We then followed the preprocessing steps in [Wang et al.2018] to tokenize and lowercase all words, exclude stopwords and words appearing in fewer than 100 documents; this resulted in a vocabulary size of 8,660. For equations, we use the 1,000 most frequent LaTeX tokens as our vocabulary.

 

Topic Model  50  100 

(# Topics)

 

LDA (context only) .085 .083
Ours (context only) .085 .084
Ours (context + Eq BOW)  .087 .086
Ours (context + Eq LSTM) .097 .094
Ours (context + Eq LSTM shuffled) .086 .085

 

Table 1: Topic coherence of different topic models, evaluated on the held-out arXiv data. Our full TopicEq model is shown as “Ours (context + Eq LSTM).”

 

Quantum physics

spin energy field electron magnetic state states hamiltonian

[2pt/1.5pt]

Particle physics

higgs neutrino coupling decay scale masses mixing quark

[2pt/1.5pt]

Astrophysics

mass gas star stellar galaxies disk halo radius luminosity

[2pt/1.5pt]

Relativity

black metric hole schwarzschild gravity holes einstein

[2pt/1.5pt]

Number theory

prime integer numbers conjecture integers degree modulo

[2pt/1.5pt]

Graph theory

graph vertex vertices edges node edge number set tree

[2pt/1.5pt]

Linear algebra

matrix matrices vector basis vectors diagonal rank linear

[2pt/1.5pt]

Optimization

problem optimization algorithm function solution gradient

[2pt/1.5pt]

Probability

random probability distribution process measure time

[2pt/1.5pt]

Machine learning

layer word image feature sentence model cnn lstm training

 

Table 2: Topics learned by the TopicEq model. Left: topic name (summarized by us). Right: top words in topic.

Model setting.

For the inference network , we use a 2-layer FFNN with 300 units, similar to [Miao, Yu, and Blunsom2016, Miao, Grefenstette, and Blunsom2017]. The equation TE-LSTM architecture has two layers and state size 500, with dropout rate 0.5 applied to each layer [Srivastava et al.2014]. The parameters of the TopicEq model are jointly optimized by Adam [Kingma and Ba2015]

, with batch size 200, learning rate 0.002, and gradient clipping 1.0

[Pascanu, Mikolov, and Bengio2012].

Topic Model Evaluation

We first study the topic modeling performance of TopicEq, by evaluating the coherence of the learned topics [Chang et al.2009, Newman et al.2010b, Mimno et al.2011]. Specifically, following [Lau, Newman, and Baldwin2014], we compute the normalized PMI metric on the held-out test set. As our TopicEq model incorporates joint, RNN-based equation model, to analyze its effect, we compare the full TopicEq model with the following baseline topic models:

  • [topsep=1pt]

  • LDA (context only): we apply LDA to the word text

  • Ours (context only): TopicEq without the equation model

  • Ours (context + Eq BOW): TopicEq’s joint LSTM equation model (Eq 3) is replaced by a baseline bag-of-tokens model similar to that for context words.

The evaluation results are summarized in Table 1. The full TopicEq model is shown as “Ours (context + Eq LSTM)” in the table. We observe that TopicEq’s topic model component (2nd row) performs on a par with LDA (1st row), but it achieves a significant boost (+0.01) when trained together with the LSTM equation model (4th row). Adding equations as bag of tokens (3rd row) does improve topic models marginally (+0.002), but the improvement made by using joint LSTM equation model is 5 times greater. These results show that a joint RNN equation model provides significant information to aid topic modeling of scientific texts.

 

        Equation Model Perplexity Error (%)
 50  100  100

 

No joint training

LSTM (no topic)  5.81 5.81 15.3
LSTM + LDA  5.54 5.52 13.4

Joint training with topic model

TD-LSTM    (Lau et al. 2017) 5.44 5.41 12.5
TE-LSTM    (Ours) 5.36 5.34 11.7

 

Table 3: Performance of different equation models, evaluated on held-out arXiv data. We report the perplexity metric (for # topics 50, 100 if topic info is used), and the syntax error rate of generated LaTeX equations (for # topics 100).

Why is the RNN helpful?

We hypothesize that one reason why the joint RNN equation model is more helpful than the bag-of-tokens equation model is that the RNN also captures syntax-level information in equations. But one might argue that the introduction of the RNN itself was useful for topic modeling (e.g. as a form of regularization). To study our hypothesis, we re-trained TopicEq with each equation’s token order randomly shuffled in the training data—thus corrupting the syntactic information of each equation. The result is shown in Table 1 as “Ours (context + Eq LSTM shuffled).” This time, the topic model performance degrades severely and falls to the level of the baseline topic model, “Ours (context only)”. This result supports the claim that the original TopicEq’s joint RNN actually captured syntactic features of equations, providing more effective help for topic modeling than a bag-of-token equation model does.

This idea also makes intuitive sense. Mathematical equations use a much smaller vocabulary (symbols / variables) than word texts, and thus often need phrase or syntax-level information to aid topic modeling. For example, in the equations in Figure 1, phrases like (use of super/sub-scripts for a tensor) and (regularization term) provide rich information to identify the topics (relativity and optimization), while the corresponding bags of tokens and themselves do not provide as much help.

Learned topics.

To visualize the topic modeling performance, we sampled 10 topics learned by TopicEq (Table 2). They intuitively reflect the scientific topics of arXiv articles.

Table 4: The TopicEq model generates equations that reflect the characteristics of given topics. Left: topic (picked from Table 2). Right: equations generated by the model conditioned on the given topic (one-hot topic vector ).

Equation Model Evaluation

Next, we evaluate the equation model component of TopicEq by measuring the test set perplexity. Additionally, as the grammaticality of equations can be measured using the LaTeX compiler, we also evaluate the syntax error rate of generated equations. We compare our TE-LSTM with

  • [topsep=1pt]

  • a generic LSTM (no topic knowledge)

  • LSTM + LDA: the topic vector obtained from a pre-trained LDA is concatenated to the output of LSTM

and a recent topic-dependent LSTM applied to our task

TD-LSTM and our TE-LSTM are jointly trained with our topic model component. As Table 3 shows, all the topic-dependent LSTMs are superior to the vanilla LSTM in both the perplexity metric and syntax error metric. Moreover, our TE-LSTM outperforms TD-LSTM, suggesting that the model better incorporates topic knowledge by embedding inside the LSTM. We also find that compared to [Wang et al.2018]’s Mixture-of-Expert LSTM, our model achieves similar performance in this task while requiring fewer parameters and much less training time (40% reduction). In total, compared to the generic LSTM, our TE-LSTM equation model reduces test perplexity by 8% (relative) and syntax error rate by 3.5% (absolute). This result suggests that incorporating context/topic information can improve the quality and grammaticality of equation modeling.

Qualitative Analysis & Applications

Table 5: We let the TopicEq model greedily generate equations while smoothly changing

between two topics (via linear interpolation). Left: given topic pair and its interpolation. Right: generated equation (for the first topic pair, we let the model generate from

; for the second pair, from ).
Table 6: Given a set of context words picked from an article abstract (1st column), we let TopicEq infer its topic proportions (2nd col) and generate equations (3rd col).

 

             

Given Equation

[[ ]] shows the correct formula name for readers

Inferred Topic

  

(showing top 5 words)

2-3[2pt/1.5pt]

by our TopicEq

by bag-of-token baseline

 

#1

 

 

[[Schrödinger Equation]]

hamiltonian, spin, particle,

interaction, wave

  

time, operator, space,

hamiltonian, system

  
[2pt/1.5pt]

#2

 

 

[[Newton’s 2nd Law of Motion]]

velocity, particle, pressure,

motion, force

  

time, velocity, particle,

diffusion, force

  
[2pt/1.5pt]

#3

 

 

[[Potential energy & Work]]

direction, force, surface,

strain, stress

     ?

method, order, solution,

numerical, problem

   (vague)
[2pt/1.5pt]

#4

 

 

[[LSTM]]

layer, word, image,

feature, network

  

function, section, problem,

condition, solution

   (vague)
[2pt/1.5pt]

#5

 

 

random, variable, probability,

distribution, entropy

  

probability, random, theorem

variable, distribution

  
[2pt/1.5pt]

#6

 

    

measure, random, process,

gaussian, convergence

  

probability, random, theorem

variable, distribution

  
[2pt/1.5pt]

#7

 

    

[[Taylor Expansion]]

coefficients, series, expansion

fourier, polynomial

  

polynomial, series, function,

convergence, order

  
[2pt/1.5pt]

#7 

    

[[Taylor Expansion]]

coefficients, series, expansion

fourier, polynomial

  

function, integral, equation

point, solution

   (fooled)

 

Table 7: The TopicEq model can infer the appropriate topic for equations from various domains, with better precision and consistency than bag-of-token baseline. Left: given equation. Right: topic inferred by our model and the baseline. indicates that the inferred topic is correct; not good. We verified that the exact same equations did not appear in the training data.

Topic-aware Equation Generation

The TopicEq model can generate meaningful equations from specified topics, using Eq 3 (TE-LSTM). For example, given a topic , we let be the one-hot vector representing the topic; conditioned on , and starting from <START> token, we keep sampling the next LaTeX token until the <END> token is generated. Table 4 shows several topics picked from Table 2 (left), and equations generated from each of these topics (right). We see that the artificial equations generated by the model clearly reflect the distinctive characteristics of the given topics. For instance, derivatives, and number + units are generally used for physics; electron configuration for quantum physics; series of tensors like for relativity; prime number for number theory; , clauses for probability. We also note that the equations generated by our TE-LSTM use not only topic-specific symbols but also topic-specific phrases and syntax (e.g., a set definition is used for linear algebra; “ subject to” clause for optimization). These qualitative results support that TopicEq is capable of fully incorporating topic information for equation modeling.

Mixtures of topics.

The model can also generate equations from a mixture of topics by setting accordingly. To qualitatively analyze the space of the topic vector in terms of equation generation, we let the model generate equations while smoothly changing between two topics (i.e., one-hot vectors and ) via linear interpolation: for . In Table 5, for two examples we show the given topic pair and its interpolation (left), and the equation greedily decoded from each (right). We let the model start all equations from in the first example (astrophysics and graph theory), and from in the second example (optmization and statistics). In both cases we observe that the generated equations make a smooth transition from one topic to the other — e.g., for the first example, from using (astrophysics) to using linear algebraic term , and finally a set notation (graph theory). In the second example, where the two topics optimization and statistics are closely related, the generated equations make a very intuitive transition: from an optimization objective with norms and regularization terms (top), to using summation terms (middle) and finally expectations (bottom; statistics topic). These observations support that TopicEq learns smooth representations for the latent topic vector (especially for a mixture of closely related topics), regarding equation generation.

Finally, we illustrate that the model can generate equations from a given set of context words. Specifically, we let the model infer the topic proportion of the context words via the inference network , and then generate equations from via Eq 3 (TE-LSTM). As Table 6 shows, the model is able to infer the right topic mixture (2nd column) and generate equations that reflect those topics (e.g., solar mass and radius

are used for the top example; loss function

, , and for the bottom example).

Equation Topic Inference

Identifying the topic of equations is an important task that allows readers to obtain semantic descriptions for equations unfamiliar to them. However, while some work [Schubotz et al.2016, Stathopoulos et al.2018] has studied the task of identifying the meaning of individual mathematical symbols, no prior work has succeeded in providing descriptions to entire equations from various domains.

Our TopicEq model can be utilized to identify the topic of given equations. Specifically, with a trained TopicEq model, for a given equation eq, we find the topic (so is a one-hot vector) that maximizes the likelihood in Eq 3, which is parametrized by our topic-dependent LSTM. Table 7 shows examples of equations across different domains (1st column), and the most likely topic inferred by our model for each equation (2nd column). We used topics in this task. We observe that the TopicEq model correctly identifies the domains or even finer topics (e.g., note the distinction between #5 and #6) for most of the given equations.

 

Math
symbol
Topics
2-5[2pt/1.5pt] No Topic Probability Quntum physics Graph theory

 

energy, expectation, elliptic curve
expectation, expected value
electric field, energy
edge,

spectral sequence

[2pt/1.5pt]

mass, matrix
martingale, maximum

magnetic moment, mass

matroid, matching
[2pt/1.5pt]

polynomials, momentum, probability

probability, poisson, distribution

momentum, proton, pressure
path, perimeter, probability
[2pt/1.5pt]

temperature, transpose,
transfer matrix

stopping time, test statistic

temperature,
thermal conductivity
tree, trees,
triangulation
[2pt/1.5pt]

potential, voltage, visibility, volume
variance, volatility
voltage, potential energy
vertex, volume,

SVD

[2pt/1.5pt]
conductivity, variance,
normal distribution
standard deviation,
normal distribution
conductivity,
pauli matrices
permutation, simplex
[2pt/1.5pt]

norm, distance, conditional
conditional probability
absolute value
triangle inequality, cardinality

 

Table 8: Top word phrases predicted by our topic-aware alignment model for each math symbol. We show the prediction results for three of the learned topics (3rd-5th column), as well as the non-topic baseline (2nd column).

Is an RNN necessary for this task?

We repeated this experiment using a bag of tokens model for equations in Eq 3 (instead of LSTM), to analyze whether the RNN equation model provides an advantage over the bag of tokens-based approach in this task. As can be seen in Table 7, 3rd column, this bag-of-tokens baseline performs as well in #1 and #2, which have topic-specific variables like , , , but fails in #3 and #4, which consist of a relatively generic set of symbols and require recognizing phrases like (work) and (neural network layer) to identify the correct topic. Indeed, the topics predicted for #3 and #4 are very generic and similar. Similarly, the bag-of-tokens baseline fails to distinguish #5 and #6, most likely because it does not recognize the phase and syntax-level differences between these two equations. Finally, for #7 (Taylor Expansion), we also experimented with #7’, where we just changed some variable names without altering the equation’s meaning and syntax. While our TopicEq still recognizes this to be the same topic as #7, the bag-of-tokens baseline is fooled by the changed variable names and predicts a wrong topic. These observations suggest that the RNN equation model can capture phrase and syntax-level information, and can consistently infer the correct topics for equations from various domains. The TopicEq model could be used to help readers interpret equations unfamiliar to them.

Extension: Topic-aware alignment between mathematical tokens and words

Mathematical symbols (including variables) carry different meanings in different contexts or topics. Prior work [Pagael and Schubotz2014, Schubotz et al.2016, Stathopoulos et al.2018] has studied the task of identifying meanings of math variables using surrounding words, but its topic dependence has not been modeled explicitly. Here we present a variant of the TopicEq model that captures topic-dependent alignment between mathematical tokens and words from scientific document data. Specifically, we aim to learn the most probable descriptions (word phrases) associated with a given math symbol , under a given topic or topic mixture : .

Baseline alignment model.

We use the equations and context texts from our ContextEq corpus. Similar to [Pagael and Schubotz2014], we consider that the descriptions of math symbols often appear in the sentence immediately before or after the given equation (immediate context). We then consider a simple alignment model between symbols in the equation and phrases in the immediate context, such that

(6)

Here vector is the bag-of-tokens representation of the equation. is the alignment matrix we estimate from the data, by maximizing the likelihood . are the vocab sizes of symbols and word descriptions. For the vocabulary of word descriptions, we collect the titles of Wikipedia pages that contain mathematical equations. We then use the top 2,000 phrases that appear in our arXiv dataset. For math symbols, we use the top .

To predict given a single symbol , we set to be the one-hot vector representing , as a surrogate.

Topic-aware alignment model.

To model , we want the alignment matrix to depend on . Motivated by the tensor factorization method in [Song, Gan, and Carin2016], we let

(7)

where , , are parameters to estimate. is the number of factors, which we set to be equal to the number of topics . To jointly perform topic modeling and alignment learning, we consider a variant of TopicEq, where we just replace Eq 3 by this topic-dependent alignment model. We train it on the ContextEq corpus.

Results and Discussion

Table 9 shows the perplexity of the baseline / topic-aware alignment models evaluated on the held-out test set. We observe that the topic information significantly improves the alignment between math symbols and word descriptions, reducing the perplexity by more than 33% (relative).

 

Alignment Model  50 100 

(# Topics)

 

Baseline (no topic) 602 602
Topic-Aware 406 387

 

Table 9: Test perplexity for phrase prediction.

 

Topic Model  50  100 

(# Topics)

 

Context Only .085 .084
with joint

Alignment Model

 
.088 .087

 

Table 10: Topic coherence evaluation for each topic model.

Qualitative results.   Table 8 shows the actual top phrases predicted by the alignment models for several math symbols that are used in a wide range of domains. The proposed TopicEq variant indeed learns the topic-dependent alignment between symbols and words. For instance, it associates with “expectation” for the probability topic, “electric field” for quantum physics, and “edge” for graph theory, which makes intuitive sense. On the other hand, the baseline (no topic) model associates with “energy”, which is simply the description that appears most frequently across all articles.
This is another example where the TopicEq framework can be used to capture the relation of topics and mathematics.

Utility.

We also note that our topic-aware alignment model can be conditioned on a mixture of topics by setting accordingly. Given a context text and equation, this model can infer the topic proportion by the topic model component, and then use the topic-aware alignment component to infer the most probable meaning of each variable in the given equation. This could aid readers to comprehend scientific documents containing mathematics unfamiliar to them.

Effect on topic modeling.

In Table 10, we compare our baseline topic model (top) and this TopicEq variant with the alignment component (bottom). The joint alignment model provides moderate improvements for topic modeling quality.

Conclusion

Motivated by the topical correspondence between text and mathematical equations observed in scientific documents, we proposed TopicEq, a joint topic-equation model that generates the text by a topic model and the equations by a topic-dependent RNN. This joint model outperforms existing topic models and equation models for scientific texts. We also qualitatively analyzed TopicEq, and showed its applications and extensions, such as equation topic inference and topic-aware alignment of mathematical symbols and words.

Acknowledgments

We thank Matt Bonakdarpour, Paul Ginsparg, Samuel Helms, and Kriste Krstovski for their assistance, and Jungo Kasai as well as the anonymous reviewers for their feedback. This work was supported in part by a grant from the Alfred P. Sloan Foundation.

References

  • [Ahn et al.2016] Ahn, S.; Choi, H.; Pärnamaa, T.; and Bengio, Y. 2016. A neural knowledge language model. arXiv:1608.00318.
  • [Blei and Jordan2003] Blei, D. M., and Jordan, M. I. 2003. Modeling annotated data. In SIGIR.
  • [Blei and Lafferty2006] Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models. In ICML.
  • [Blei and Lafferty2007] Blei, D. M., and Lafferty, J. D. 2007. A correlated topic model of science. The Annals of Applied Statistics 17–35.
  • [Blei, Ng, and Jordan2003] Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. JMLR.
  • [Cao et al.2015] Cao, Z.; Li, S.; Liu, Y.; Li, W.; and Ji, H. 2015. A novel neural topic model and its supervised extension. In AAAI.
  • [Chang et al.2009] Chang, J.; Gerrish, S.; Wang, C.; Boyd-Graber, J. L.; and Blei, D. M. 2009. Reading tea leaves: How humans interpret topic models. In NIPS.
  • [Deng et al.2017] Deng, Y.; Kanervisto, A.; Ling, J.; and Rush, A. M. 2017. Image-to-markup generation with coarse-to-fine attention. In ICML.
  • [Dieng et al.2017] Dieng, A. B.; Wang, C.; Gao, J.; and Paisley, J. 2017. Topicrnn: A recurrent neural network with long-range semantic dependency. In ICLR.
  • [Hall, Jurafsky, and Manning2008] Hall, D.; Jurafsky, D.; and Manning, C. D. 2008. Studying the history of ideas using topic models. In EMNLP.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
  • [Jordan et al.1999] Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine learning 37(2):183–233.
  • [Jozefowicz et al.2016] Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; and Wu, Y. 2016. Exploring the limits of language modeling. arXiv:1602.02410.
  • [Karpathy2015] Karpathy, A. 2015. The unreasonable effectiveness of recurrent neural networks. Blog posting, May 21.
  • [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
  • [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. In ICLR.
  • [Krstovski and Blei2018] Krstovski, K., and Blei, D. M. 2018. Equation embeddings. arXiv:1803.09123.
  • [Lan et al.2015] Lan, A. S.; Vats, D.; Waters, A. E.; and Baraniuk, R. G. 2015. Mathematical language processing: Automatic grading and feedback for open response mathematical questions. In ACM Conference on Learning@ Scale.
  • [Larochelle and Lauly2012] Larochelle, H., and Lauly, S. 2012. A neural autoregressive topic model. In NIPS.
  • [Lau, Baldwin, and Cohn2017] Lau, J. H.; Baldwin, T.; and Cohn, T. 2017. Topically driven neural language model. In ACL.
  • [Lau, Newman, and Baldwin2014] Lau, J. H.; Newman, D.; and Baldwin, T. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL.
  • [Miao, Grefenstette, and Blunsom2017] Miao, Y.; Grefenstette, E.; and Blunsom, P. 2017. Discovering discrete latent topics with neural variational inference. In ICML.
  • [Miao, Yu, and Blunsom2016] Miao, Y.; Yu, L.; and Blunsom, P. 2016. Neural variational inference for text processing. In ICML.
  • [Mikolov and Zweig2012] Mikolov, T., and Zweig, G. 2012. Context dependent recurrent neural network language model. SLT 12:234–239.
  • [Mikolov et al.2010] Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In Interspeech.
  • [Mimno et al.2011] Mimno, D.; Wallach, H. M.; Talley, E.; Leenders, M.; and McCallum, A. 2011. Optimizing semantic coherence in topic models. In EMNLP.
  • [Newman et al.2010a] Newman, D.; Baldwin, T.; Cavedon, L.; Huang, E.; Karimi, S.; Martinez, D.; Scholer, F.; and Zobel, J. 2010a. Visualizing search results and document collections using topic maps. Web Semantics: Science, Services and Agents on the World Wide Web 8(2-3):169–175.
  • [Newman et al.2010b] Newman, D.; Lau, J. H.; Grieser, K.; and Baldwin, T. 2010b. Automatic evaluation of topic coherence. In NAACL.
  • [Pagael and Schubotz2014] Pagael, R., and Schubotz, M. 2014. Mathematical language processing project. In CICM.
  • [Pascanu, Mikolov, and Bengio2012] Pascanu, R.; Mikolov, T.; and Bengio, Y. 2012. On the difficulty of training recurrent neural networks. arXiv:1211.5063.
  • [Roy, Upadhyay, and Roth2016] Roy, S.; Upadhyay, S.; and Roth, D. 2016. Equation parsing: Mapping sentences to grounded equations. In EMNLP.
  • [Schubotz et al.2016] Schubotz, M.; Grigorev, A.; Leich, M.; Cohl, H. S.; Meuschke, N.; Gipp, B.; Youssef, A. S.; and Markl, V. 2016. Semantification of identifiers in mathematics for better math information retrieval. In SIGIR.
  • [Sojka and Líška2011] Sojka, P., and Líška, M. 2011. Indexing and searching mathematics in digital libraries. In CICM.
  • [Song, Gan, and Carin2016] Song, J.; Gan, Z.; and Carin, L. 2016. Factored temporal sigmoid belief networks for sequence learning. In ICML.
  • [Srivastava and Sutton2017] Srivastava, A., and Sutton, C. 2017. Autoencoding variational inference for topic models. In ICLR.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR.
  • [Stathopoulos et al.2018] Stathopoulos, Y.; Baker, S.; Rei, M.; and Teufel, S. 2018. Variable typing: Assigning meaning to variables in mathematical text. In NAACL.
  • [Teh et al.2005] Teh, Y. W.; Jordan, M. I.; Beal, M. J.; and Blei, D. M. 2005. Sharing clusters among related groups: Hierarchical dirichlet processes. In NIPS.
  • [Wang et al.2015] Wang, Y.; Gao, L.; Wang, S.; Tang, Z.; Liu, X.; and Yuan, K. 2015. Wikimirs 3.0: a hybrid mir system based on the context, structure and importance of formulae in a document. In JCDL.
  • [Wang et al.2018] Wang, W.; Gan, Z.; Wang, W.; Shen, D.; Huang, J.; Ping, W.; Satheesh, S.; and Carin, L. 2018. Topic compositional neural language model. In AISTATS.
  • [Xie, Deng, and Xing2015] Xie, P.; Deng, Y.; and Xing, E. 2015.

    Diversifying restricted boltzmann machine for document modeling.

    In KDD.
  • [Zanibbi et al.2016] Zanibbi, R.; Davila, K.; Kane, A.; and Tompa, F. W. 2016. Multi-stage math formula search: Using appearance-based similarity metrics at scale. In SIGIR.