Solving math word problems has been an interest of the natural language processing community since the 1960sFeigenbaum et al. (1963); Bobrow (1964). More recently, algorithms for learning to solve algebra problems have gone in complementary directions: semantic and purely data-driven.
Semantic methods learn from data how to map problem texts to a semantic representation which can then be converted to an equation. These representations combine set-like constructs Hosseini et al. (2014) with hierarchical representations like equation trees Koncel-Kedziorski et al. (2015); Roy and Roth (2015); Wang et al. (2018). Such methods have the benefit of being interpretable, but no semantic representation general enough to solve all varieties of math word problems, including proportion problems and those that map to systems of equations, has been found.
Another popular line of research is on purely data-driven solvers. Given enough training data, data-driven models can learn to map word problem texts to arbitrarily complex equations or systems of equations. These models have the additional advantage of being more language-independent than semantic methods, which often rely on parsers and other NLP tools. To train these fully data driven models, large-scale datasets for both English and Chinese were recently introduced Wang et al. (2017); Koncel-Kedziorski et al. (2016).
In response to the success of representation learning elsewhere in NLP, sequence to sequence (seq2seq) models have been applied to algebra problem solving Wang et al. (2017). These powerful models have been shown to outperform other data-driven approaches in a variety of tasks. However, it is not obvious that solving word problems is best modeled as a sequence prediction task rather than a classification or retrieval task. Downstream applications such as question answering or automated tutoring systems may never have to deal with arbitrarily complex or even unseen equation types, obviating the need for a sequence prediction model.
These considerations beg the questions: how do data-driven approaches to math word problem solving compare to each other? How can data-driven approaches benefit from recent advances in neural representation learning? What are the limits of data-driven solvers?
In this paper, we thoroughly examine data-driven techniques on three larger algebra word problem datasets Huang et al. (2016); Koncel-Kedziorski et al. (2016); Wang et al. (2017). We study classification, generation, and information retrieval models, and examine popular extensions to these models such as structured self-attention Lin et al. (2017) and the use of pretrained word embeddings Pennington et al. (2014); Peters et al. (2018).
Our experiments show that a well-tuned neural equation classifier consistently performs better than more sophisticated solvers. We provide evidence that pretrained word embeddings, useful in other tasks, are not helpful for word problem solving. Advanced modeling such as structured self-attention is not shown to improve performance versus a well-tuned BiLSTM Classifier. Our error analysis supports the idea that, while data-driven techniques are powerful and robust, many word problems require semantic or world knowledge that cannot be easily incorporated into an end-to-end learning framework.
2 Problem Formulation
Solving an algebra word problem (as shown below) requires finding the correct solution given the text of the problem.
|Aliyah had some candy to give to her 3 children. She first took 2 pieces for herself and then evenly divided the rest among her children. Each child received 5 pieces. With how many pieces did she start?|
|2 + (3 * 5) = x|
|B + (A * C) = x|
Similar to previous data-driven methods, we frame the task as one of mapping the word problem texts to equations given the training data. Our models abstract the specific numbers away from both the word problem text and target equation, preserving the ordering of the numbers found in the problem text. The resulting abstracted equation is called an equation template. At inference time, our solvers produce an equation template given the test problem. The template is then populated with the actual numbers from the problem text and evaluated to produce a solution.
Retrieval methods map test word problem texts at inference time to the nearest training problem according to some similarity metric. The nearest neighbor’s equation template is then filled in with numbers from the test problem and solved. Following Wang et al. (2017), we use Jaccard distance in this model. For test problem and training problem , the Jaccard similarity is computed as:
. We also evaluate the use of a cosine similarity metric. Words fromand
are associated with pretrained vectorsPennington et al. (2014). These vectors are averaged across each problem, resulting in vectors and . The Cosine similarity is then computed as . Vector averaging has previously been used as a strong baseline for a variety of sentence similarity tasks Mu et al. (2017).
Classification methods learn to map problem texts to equation templates by learning parameters that minimize a cross entropy loss function over the set of training instances. At inference time, these methods choose the most likely equation template (the class) given a test word problem text. In both retrieval and classification methods, model accuracy is upper bounded by the oracle accuracy, or the number of test equation templates which appear in the training data.
The BiLSTM classification model encodes the word problem text using a bidirectional Long Short Term Memory networkHochreiter and Schmidhuber (1997) with learned parameters . The final hidden state of this encoding is scaled to the number of classes by weights
and passed through a softmax to produce a distribution over class labels. The probability of equation templatefor problem is given by:
This model is trained end-to-end using cross entropy loss.
Sentence embeddings using self-attention mechanisms Lin et al. (2017) were shown to be successful in question answering tasks Liu et al. (2017). We conjecture that algebra problem solvers can also benefit from the long distance dependencies information introduced by self-attention. Here, bi-directional LSTM encoders capture relationships among the words of the input text. A multi-hop self-attention mechanism is applied to the resulting hidden states to produce a fixed sized embedding. The different attention hops are constrained so as to reduce redundancy, ensuring that various semantic aspects of the input are included in the resulting embedding. We refer the reader to the original paper for details.
Generation methods treat equation templates as strings of formal symbols. The production of a template is considered a sequence prediction problem conditioned on the word problem text. By treating templates as sequences rather than monolithic structures, generation methods have the potential to learn finer-grained relationships between the input text and output template. They also are the only methods studied here which can induce templates during inference which were not seen at training.
We generate equation templates with seq2seq models Sutskever et al. (2014) with attention mechanisms Luong et al. (2015). These models condition the token-by-token generation of the equation template on encodings of the word problem text. Following Wang et al. (2017)
we evaluate a seq2seq with LSTMs as the encoder and decoder. We also evaluate the use of Convolutional Neural Networks (CNNs) in the encoder and decoder.
4.1 Experimental Setup
|Dataset||# Quest.||# Templates||# Sent.|
For comparison, we report solution accuracy on the Chinese language Math23K dataset Wang et al. (2017), and the English language Draw Upadhyay and Chang (2015) and Mawps Koncel-Kedziorski et al. (2016) datasets. Math23K and Mawps consist of single equation problems, and Draw contains both single and simultaneous equation problems. Details on the datasets are shown in Table 1.
The Math23K dataset contains problems with possibly irrelevant quantities. To prune these quantities, we implement a significant number identifier (SNI) as discussed in Wang et al. (2017). Our best accuracy for SNI is 97%, slightly weaker than previous results.
Our BiLSTM model’s parameters are tuned on a validation set for each dataset. We also explore two modifications of the BiLSTM’s embedding matrix , either by using pretrained GloVe embeddings Pennington et al. (2014) or using the ELMo technique of Peters et al. (2018) as implemented in the AllenNLP toolkit Gardner et al. with pretrained character embeddings. For seq2seq modeling, we use OpenNMT Klein et al. (2017)
with 500 dimensional hidden states and embeddings and a dropout rate of 0.3. The CNN uses a kernel width of 3. Optimization is done using SGD with a learning rate of 1, decayed by half if the validation perplexity does not decrease after an epoch.
|State of the art||52.0||–||64.7|
|Semantic Limitations (36%)||Kendra made punch for her friend’s birthday party. She used 3/4 of a gallon of grape juice, 1/4 of a gallon of cranberry juice , and 3/5 of a gallon of club soda. How many gallons of punch did Kendra make?|
|Sandy went to the mall to buy clothes. She spent $20 on shorts, $10 on a shirt, and $35 on a jacket. How much money did Sandy spend on clothes?|
World Knowledge (19%)
|Mary began walking home from school, heading south at a rate of 3 miles per hour. Sharon left school at the same time heading north at 5 miles per hour. How long will it take for them to be 20 miles apart?|
|If you purchase a membership for 100 dollars to receive 5% off purchases, how much would you need to spend to pay off the membership?|
reports the accuracies of the data-driven models for solving algebra word problems. The classification models perform better than retrieval or generation models, despite their limited modeling power. The self-attention classification model performs well across all datasets. For the largest dataset (Math23K), a simple, well-tuned classifier can outperform the more sophisticated sequence-to-sequence and self-attention models.
Table 3 shows results of augmenting the classifier with pretrained word and character embeddings. Neither of these methods help over the English language data. It appears that the ELMo technique may require more training examples before it can improve solution accuracy.
The previous state of the art model for the Draw dataset is described in Upadhyay and Chang (2015). The state of the art for Math23K, described in Wang et al. (2017), uses a hybrid Jaccard retrieval and seq2seq model. All models shown here fall well short of the highest possible classification/retrieval accuracy, shown in Table 2 as “Oracle”. This gap invites a more detailed error analysis regarding the possible limitations of data-driven solvers.
4.3 Error Analysis
Despite the sophistication of these data-driven models, they still do not achieve optimal performance. A closer analysis of the errors these models make can illuminate the reason for this gap.
Consider Table 4, which illustrates two classes of errors made by data-driven systems. Both stem from incomplete knowledge on the part of the learning algorithm. But it is worth distinguishing the “semantic limitations” errors as this kind of information (subset relations, counts of non-numerical entities) may be possible to extract from the data provided, given a sufficiently powerful modeling technique.
The second class of errors, labeled “world knowledge”, are impossible to extract from the math data alone. Consider the first example of people walking in different directions. To solve this problem, it is necessary to know that “north” and “south” are away from each other. Complicating the problem, suppose Sharon walked east instead of north. Then the relationship between east and south would impact the problem semantics. This kind of knowledge is beyond what is conveyed in any dataset of math word problems, and is a known problem for many NLP applications.
5 Related Work
Semantic solvers provide some scaffolding for the grounding of word problem texts to equations. Mitra and Baral (2015) solve simple word problems by categorizing their operations as part-whole, change, or comparison. Shi et al. (2015) learn a semantic parser by semi-automatically inducing 9600 grammar rules over a dataset of number word problems. Works such has Roy and Roth (2015) and Koncel-Kedziorski et al. (2015) treat arithmetic word problem templates as equation trees and perform efficient tree-search by learning how to combine quantities using textual information. Roy and Roth (2017) advance this approach by considering unit consistency in the tree-search procedure. Wang et al. (2018) advance this line of work even further by modeling the search using deep Q-learning. Still, these semantic approaches are limited by their inability to model systems of equations as well as use of hand-engineered features.
Data-driven math word problem solvers include Kushman et al. (2014), who learn to predict equation templates and subsequently align numbers and unknowns from the text. Zhou et al. (2015) only assign numbers to the predicted template, reducing the search space significantly. More recently, Wang et al. (2017) provide a large dataset of Chinese algebra word problems and learn a hybrid model consisting of both retrieval and seq2seq components. The current work extends these approaches by exploring advanced techniques in data-driven solving.
We have thoroughly examined data-driven models for automatically solving algebra word problems, including retrieval, classification, and generation techniques. We find that a well-tuned classifier outperforms generation and retrieval on several datasets. One avenue for improving performance is to ensemble different models. However, in light of the error analysis provided, the incorporation of semantic and world knowledge will be necessary to achieve maximal success.
- Bobrow (1964) Daniel G Bobrow. 1964. Natural language input for a computer problem solving system.
- Feigenbaum et al. (1963) Edward A Feigenbaum, Julian Feldman, et al. 1963. Computers and thought. New York.
- (3) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to Solve Arithmetic Word Problems with Verb Categorization. In EMNLP, pages 523–533.
- Huang et al. (2016) Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. 2016. How well do computers solve math word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 887–896.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
- Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Ang. 2015. Parsing Algebraic Word Problems into Equations. TACL, 3:585–597.
- Koncel-Kedziorski et al. (2016) Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. Mawps: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157.
- Kushman et al. (2014) Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 271–281.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
- Liu et al. (2017) Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2017. Stochastic answer networks for machine reading comprehension. arXiv preprint arXiv:1712.03556.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- Mitra and Baral (2015) Arindam Mitra and Chitta Baral. 2015. Learning to automatically solve logic grid puzzles. In EMNLP.
- Mu et al. (2017) Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2017. Representing sentences as low-rank subspaces. ACL.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving General Arithmetic Word Problems. In EMNLP.
- Roy and Roth (2017) Subhro Roy and Dan Roth. 2017. Unit dependency graph and its application to arithmetic word problem solving. AAAI.
- Shi et al. (2015) Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically Solving Number Word Problems by Semantic Parsing and Reasoning. In EMNLP.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Upadhyay and Chang (2015) Shyam Upadhyay and Ming-Wei Chang. 2015. Draw: A challenging and diverse algebra word problem set. Technical report, Number MSR-TR-2015-78, Oct. 2015.[Online]. Available: https://www. microsoft. com/en-us/research/wp-content/uploads/2016/02/tech_rep. pdf.
Wang et al. (2018)
Lei Wang, Dongxiang Zhang, Lianli Gao, Jingkuan Song, Long Guo, and Heng Tao
Mathdqn: Solving arithmetic word problems via deep reinforcement learning.
- Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854.
- Zhou et al. (2015) Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to Solve Algebra Word Problems Using Quadratic Programming. In EMNLP.