Developing computer systems to automatically solve math word problems (MWPs) has been an interest of NLP researchers since 1963 Feigenbaum et al. (1963); Bobrow (1964). A typical MWP is shown in Table 1
. Readers are asked to infer how many pens and pencils Jessica have in total, based on the textual problem description provided. Statistical machine learning-based methodsKushman et al. (2014); Amnueypornsakul and Bhat (2014); Zhou et al. (2015); Mitra and Baral (2016); Roy and Roth (2018) and semantic parsing-based methods Shi et al. (2015); Koncel-Kedziorski et al. (2015); Roy and Roth (2015); Huang et al. (2017) are proposed to tackle this problem, yet they still require considerable manual efforts on feature or template designing. For more literatures about solving math word problems automatically, refer to a recent survey paper Zhang et al. (2018).
Recently, the Deep Neural Networks (DNNs) have opened a new direction towards automatic MWP solving.Ling et al. (2017) take multiple-choice problems as input and automatically generate rationale text and the final choice. Wang et al. (2018)
then make the first attempt of applying deep reinforcement learning to arithmetic word problem solving.Wang et al. (2017) train a deep neural solver (DNS) that needs no hand-crafted features, using the Seq2Seq model to automatically learn the problem-to-equation mapping.
|Problem: Dan has 5 pens and 3 pencils, Jessica has 4 more pens and 2 less pencils than him. How many pens and pencils does Jessica have in total?|
|Equation: ; Solution: 10|
Although promising results have been reported, the model in (Wang et al., 2017) still suffers from an equation duplication problem: a MWP can be solved by multiple equations. Taking the problem in Table 1 as an example, it can be solved by various equations such as , and . This duplication problem results in a non-deterministic output space, which has a negative impact on the performance of most data-driven methods. In this paper, by considering the uniqueness of expression tree, we propose an equation normalization method to solve this problem.
Given the success of different Seq2Seq models on machine translation (such as recurrent encoder-decoder Wu et al. (2016), Convolutional Seq2Seq model Gehring et al. (2017) and Transformer Vaswani et al. (2017)), it is promising to adapt them to MWP solving. In this paper, we compare the performance of three state-of-the-art Seq2Seq models on MWP solving. We observe that different models are able to correctly solve different MWPs, therefore, as a matter of course, an ensemble model is proposed to achieve higher performance. Experiments on dataset Math23K show that by adopting the equation normalization and model ensemble techniques, the accuracy boosts from 60.7% to 68.4%.
The remaining part of this paper is organized as follows: we first introduce the Seq2Seq Framework in Section 2. Then the equation normalization process is presented in Section 3, following which three Seq2Seq models and an ensemble model are applied to MWP solving in Section 4. The experimental results are presented in Section 5. Finally we conclude this paper in Section 6.
2 Seq2Seq Framework
The process of using Seq2Seq model to solve MWPs can be divided into two stages Wang et al. (2017). In the first stage (number mapping stage), significant numbers (numbers that will be used in real calculation) in problem are mapped to a list of number tokens by their natural order in the problem text. Throughout this paper, we use the significant number identification (SNI) module proposed in Wang et al. (2017) to identify whether a number is significant. In the second stage, Seq2Seq models can be trained by taking the problem text as the source sequence and equation templates (equations after number mapping) as the target sequence.
Taking the problem in Table 1 as an example, first we can obtain a number mapping , and transform the given equation to an equation template . During training, the objective of our Seq2Seq
model is to maximize the conditional probability, which will be decomposed to token-wise probabilities. During decoding, we use beam search to approximate the most likely equation template. After that, we replace the number tokens with actual numbers and calculate the solution with a math solver.
3 Equation Normalization
In the number mapping stage, the equations have been successfully transformed to equation templates . However, due to the equation duplication problem introduced in Section 1, this problem-equation templates formalization is a non-deterministic transduction that will have adverse effects on the performance of maximum likelihood estimation. There are two types of equation duplication: 1) order duplication such as “” and “”, 2) bracket duplication such as “” and “”.
To normalize the order-duplicated templates, we define two normalization rules:
Rule 1: Two duplicated equation templates with unequal length should be normalized to the shorter one. For example, two equation templates “”, “” should be normalized to the latter one.
Rule 2: The number tokens in equation templates should be ordered as close as possible to their order in number mapping. For example, three equation templates “”, “” and “” should be normalized to “”.
To solve the bracket duplication problem, we further normalize the equation templates to an expression tree. Every inner node in the expression tree is an operator with two children, while each leaf node is expected to be a number token. An example of expressing equation template as the unique expression tree is shown in Figure 1.
After equation normalization, the Seq2Seq models can solve MWPs by taking problem text as source sequence and the postorder traversal of an unique expression tree as target sequence, as shown in Figure 2.
In this section, we present three types of Seq2Seq
models to solve MWPs: bidirectional Long Short Term Memory network (BiLSTM)Wu et al. (2016), Convolutional Seq2Seq model Gehring et al. (2017), and Transformer Vaswani et al. (2017). To benefit the output accuracy with all three architectures, we propose to use a simple ensemble method.
The BiLSTM model uses two LSTMs (forward and backward) to learn the representation of each token in the sequence based on both the past and the future context of the token. At each time step of decoding, the deocder uses a global attention mechanism to read those representations.
In more detail, we use two-layer Bi-LSTM cells with 256 hidden units as encoder, and two layers LSTM cells with 512 hidden units as decoder. In addition, we use Adam optimizer with learning rate , , and
. The epochs, minibatch size, and dropout rate are set to 100, 64, and 0.5, respectively.
ConvS2S Gehring et al. (2017) uses a convolutional architecture instead of RNNs. Both encoder and decoder share the same convolutional structure that uses
kernels striding from one side to the other, and uses gate linear units as non-linearity activations over the output of convolution.
Our ConvS2S model adopts a four layers encoder and a three layers decoder, both using kernels of width 3 and hidden size 256. We adopt early stopping and learning rate annealing and set max-epochs equals to 100.
Vaswani et al. (2017) proposed the Transformer based on an attention mechanism without relying on any convolutional or recurrent architecture. Both encoder and decoder are composed of a stack of identical layers. Each layer contains two parts: a multi-head self-attention module and a position-wise fully-connected feed-forward network.
Our transformer is four layers deep, with , , , and , where is the number of heads of its self-attention, is the dimension of keys, is the dimension of values, and is the output dimension of each sub-layer. In addition, we use Adam optimizer with learning rate , , and dropout rate of 0.3.
4.4 Ensemble Model
Through careful observation (detailed in Section 5.2), we find that each model has a speciality in solving problems. Therefore, we propose an ensemble model which selects the result according to models’ generation probability:
where is the target sequence, and is the source sequence. Finally, the output of the model with the highest generation probability is selected as the final output.
In this section, we conduct experiments on dataset Math23K to examine the performance of different Seq2Seq models. Our main experimental result is to show a significant improvement over the baseline methods. We further conduct a case study to analyze why different Seq2Seq models can solve different kinds of MWPs.
Dataset: Math23K111https://ai.tencent.com/ailab/Deep_Neural_Solver_for_Math_Word_Problems.html collected by Wang et al. (2017) contains 23,162 labeled MWPs. All these problems are linear algebra questions with only one unknown variable.
Baselines: We compare our methods with two baselines: DNS and DNS-Hybrid. Both of them are proposed in Wang et al. (2017), with state-of-the-art performance on dataset Math23K. The DNS is a vanilla Seq2Seq model that adopts GRU Chung et al. (2014) as encoder and LSTM as decoder. The DNS-Hybrid is a hybrid model that combines DNS and a retrieval-based solver to achieve better performance.
|Acc w/o EN (%)||Acc w/ EN (%)|
|Example 1: Two biological groups have produced 690 () butterfly specimens in 15 () days. The first group produced 20 () each day. How many did the second group produced each day?|
|Bi-LSTM: ; (correct) ConvS2S: ; (error) Transformer: ; (error)|
|Example 2: A plane, in a speed of 500 () km/h, costs 3 () hours traveling from city A to city B. It only costs 2 () hours for return. How much is the average speed of the plane during this round-trip?|
|Bi-LSTM: ; (error) ConvS2S: ; (error) Transformer: ; (correct)|
|Example 3: Stamp A is 2 () paise denomination, and stamp B is 7 () paise denomination. If we are asked to buy 10 () of each, how much more does it cost to buy stamps A than to buy stamps B.|
|Bi-LSTM: ; (error) ConvS2S: ; (correct) Transformer: ; (error)|
In experiments, we use the testing set in Math23K as the test set, and randomly split 1, 000 problems from the training set as validation set. Evaluation results are summarized in Table 2. First, to examine the effectiveness of equation normalization, model performance with and without equation normalization are compared. Then the performance of DNS, DNS-Hybrid, Bi-LSTM, ConvS2S, Transformer, and Ensemble model are examined on the dataset.
Several observations can be made from the results. First, the equation normalization process significantly improves the performance of each model. The accuracy of different models gain increases from 2.7% to 7.1% after equation normalization. Second, Bi-LSTM, ConvS2S, Transformer can achieve much higher performance than DNS, which means that popular machine translation models are also efficient in automatic MWP solving. Third, by combining the Seq2Seq models, our ensemble model gains additional 1.7% increase on accuracy.
In addition, we have further conducted three extra experiments to disentangle the benefits of three different EN techniques. Table 3 gives the details of the ablation study of the three Seq2seq models. Taking Bi-LSTM as an example, accuracies of rule 1 (SE), rule 2 (OE) and eliminating brackets (EB) are 63.1%, 63.7% and 65.3%, respectively. Obviously, the performance of Seq2esq models benefits from the equation normalization technologies.
5.2 Case Study
Further, we conduct a case analysis on the capability of different Seq2seq models and provide three examples in Table 4. Our analysis is summarized as follows: 1) Transformer occasionally generates mathematically incorrect templates, while Bi-LSTM and ConvS2S almost do not, as shown in Example 1. This is probably because the size of training data is still not enough to train the multi-head self-attention structures; 2) In Example 2, the Transformer is adapted to solve problems that require complex inference. It is mainly because different heads in a self-attention structure can model various types of relationships between number tokens; 3) The multi-layer convolutional block structure in ConvS2S can properly process the context information of number tokens. In Example 3, it is the only one that captures the relationship between stamp A and stamp B.
In this paper, we first propose an equation normalization method that normalizes duplicated equation templates to an expression tree. We test different seq2seq models on MWP solving and propose an ensemble model to achieve higher performance. Experimental results demonstrate that the proposed equation normalization method and the ensemble model can significantly improve the state-of-the-art methods.
This work is supported in part by the National Nature Science Foundation of China under grants No. 61602087, and the Fundamental Research Funds for the Central Universities under grants No. ZYGX2016J080.
- Amnueypornsakul and Bhat (2014) Bussaba Amnueypornsakul and Suma Bhat. 2014. Machine-guided solution to mathematical word problems. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computation, PACLIC 28, Cape Panwa Hotel, Phuket, Thailand, December 12-14, 2014, pages 111–119.
- Bobrow (1964) Daniel G Bobrow. 1964. Natural language input for a computer problem solving system.
- Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Feigenbaum et al. (1963) Edward A Feigenbaum, Julian Feldman, et al. 1963. Computers and thought. New York.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.
Huang et al. (2017)
Danqing Huang, Shuming Shi, Chin-Yew Lin, and Jian Yin. 2017.
Learning fine-grained expressions to solve math word problems.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 805–814. Association for Computational Linguistics.
- Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. TACL, 3:585–597.
- Kushman et al. (2014) Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. Association for Computational Linguistics.
- Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167. Association for Computational Linguistics.
- Mitra and Baral (2016) Arindam Mitra and Chitta Baral. 2016. Learning to use formulas to solve simple arithmetic problems. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
- Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1743–1752.
- Roy and Roth (2018) Subhro Roy and Dan Roth. 2018. Mapping to declarative knowledge for word problem solving. TACL, 6:159–172.
- Shi et al. (2015) Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving number word problems by semantic parsing and reasoning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1132–1142.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
Wang et al. (2018)
Lei Wang, Dongxiang Zhang, Lianli Gao, Jingkuan Song, Long Guo, and Heng Tao
Mathdqn: Solving arithmeticword problems via deep reinforcement
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press.
- Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854. Association for Computational Linguistics.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Zhang et al. (2018) Dongxiang Zhang, Lei Wang, Nuo Xu, Bing Tian Dai, and Heng Tao Shen. 2018. The gap of semantic parsing: A survey on automatic math word problem solvers. arXiv preprint arXiv:1808.07290.
- Zhou et al. (2015) Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 817–822.