Learning by Fixing: Solving Math Word Problems with Weak Supervision

12/19/2020 ∙ by Yining Hong, et al. ∙ 0

Previous neural solvers of math word problems (MWPs) are learned with full supervision and fail to generate diverse solutions. In this paper, we address this issue by introducing a weakly-supervised paradigm for learning MWPs. Our method only requires the annotations of the final answers and can generate various solutions for a single problem. To boost weakly-supervised learning, we propose a novel learning-by-fixing (LBF) framework, which corrects the misperceptions of the neural network via symbolic reasoning. Specifically, for an incorrect solution tree generated by the neural network, the fixing mechanism propagates the error from the root node to the leaf nodes and infers the most probable fix that can be executed to get the desired answer. To generate more diverse solutions, tree regularization is applied to guide the efficient shrinkage and exploration of the solution space, and a memory buffer is designed to track and save the discovered various fixes for each problem. Experimental results on the Math23K dataset show the proposed LBF framework significantly outperforms reinforcement learning baselines in weakly-supervised learning. Furthermore, it achieves comparable top-1 and much better top-3/5 answer accuracies than fully-supervised methods, demonstrating its strength in producing diverse solutions.



There are no comments yet.


page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 1: Exemplar MWP with multiple solutions.

Solving math word problems (MWPs) poses unique challenges for understanding natural-language problems and performing arithmetic reasoning over quantities with commonsense knowledge. As shown in Figure 1, a typical MWP consists of a short narrative describing a situation in the world and asking a question about an unknown quantity. To solve the MWP in Figure 1, a machine needs to extract key quantities from the text, such as “100 kilometers” and “2 hours”, and understand the relationships between them. General mathematical knowledge like “distance = velocity time” is then used to calculate the solution.

Researchers have recently focused on solving MWPs using neural-symbolic models Ling et al. (2017); Wang et al. (2017); Huang et al. (2018); Wang et al. (2018); Xie and Sun (2019). These models usually consist of a neural perception module (i.e., Seq2Seq or Seq2Tree) that maps the problem text into a solution expression or tree, and a symbolic module which executes the expression and generates the final answer. Training these models requires the full supervision of the solution expressions.

However, these fully-supervised approaches have three drawbacks. First, current MWP datasets only provide one solution for each problem, while there naturally exist multiple solutions that give different paths of solving the same problem. For instance, the problem in Figure 1 can be solved by “” if we first calculate the speed and then multiply it by the total time; alternatively, we can solve it using “” by summing the distances of the first and second parts of the journey. The models trained with full supervision on current datasets are forced to fit the given solution and cannot generate diverse solutions. Second, annotating the expressions for MWPs is time-consuming. However, a large amount of MWPs with their final answers can be mined effortlessly from the internet (e.g., online forums). How to efficiently utilize these partially-labeled data without the supervision of expressions remains an open problem. Third, current supervised learning approaches suffer from the train-test discrepancy. The fully-supervised learning methods optimize expression accuracy rather than answer accuracy. However, the model is evaluated by the answer accuracy on the test set, causing a natural performance gap.

To address these issues, we propose to solve the MWPs with weak supervision, where only the problem texts and the final answers are required. By directly optimizing the answer accuracy rather than the expression accuracy, learning with weak supervision naturally addresses the train-test discrepancy. Our model consists of a tree-structured neural model similar to Xie and Sun (2019) to generate the solution tree and a symbolic execution module to calculate the answer. However, the symbolic execution module for arithmetic expressions is non-differentiable with respect to the answer accuracy, making it infeasible to use back-propagation to compute gradients. A straightforward approach is to employ policy gradient methods like REINFORCE Williams (1992) to train the neural model. The policy gradient methods explore the solution space and update the policy based on generated solutions that happen to hit the correct answer. Since the solution space is large and incorrect solutions are abandoned with zero reward, these methods usually converge slowly or fail to converge.

To improve the efficiency of weakly-supervised learning, we propose a novel fixing mechanism to learn from incorrect predictions, which is inspired by the human ability to learn from failures via abductive reasoning Magnani (2009); Zhou (2019a). The fixing mechanism propagates the error from the root node to the leaf nodes in the solution tree and finds the most probable fix that can generate the desired answer. The fixed solution tree is further used as a pseudo label to train the neural model. Figure 2 shows how the fixing mechanism corrects the wrong solution tree by tracing the error in a top-down manner.

Furthermore, we design two practical techniques to traverse the solution space and discover possible solutions efficiently. First, we observe a positive correlation between the number of quantities in the text and the size of the solution tree (the number of leaf nodes in the tree), and propose a tree regularization technique based on this observation to limit the range of possible tree sizes and shrink the solution space. Second, we adopt a memory buffer to track and save the discovered fixes for each problem with the fixing mechanism. All memory buffer solutions are used as pseudo labels to train the model, encouraging the model to generate more diverse solutions for a single problem.

In summary, by combining the fixing mechanism and the above two techniques, the proposed learning-by-fixing (LBF) method contains an exploring stage and a learning stage in each iteration, as shown in Figure 2. We utilize the fixing mechanism and tree regularization to correct wrong answers in the exploring stage and generate fixed expressions as pseudo labels. In the learning stage, we train the neural model using these pseudo labels.

We conduct comprehensive experiments on the Math23K dataset Wang et al. (2017). The proposed LBF method significantly outperforms the reinforcement learning baselines in weakly-supervised learning and achieves comparable performance with several fully-supervised methods. Furthermore, our proposed method achieves significantly better answer accuracies of all the top-3/5 answers than fully-supervised methods, illustrating its advantage in generating diverse solutions. The ablative experiments also demonstrate the efficacy of the designed algorithms, including the fixing mechanism, tree regularization, and memory buffer.

Related Work

Math Word Problems

Recently, there emerges various question-answering tasks that require human-like reasoning abilities Qi et al. (2015); Tu et al. (2014); Zhang et al. (2019); Dua et al. (2019); Hong et al. (2019); Zhu et al. (2020); Zhang et al. (2020b); Li et al. (2020b); Yu et al. (2020). Among them, solving mathematical word problems (MWPs) is a fundamental and challenging task.

Previous studies of MWPs range from traditional rule-based methods Fletcher (1985); Bakman (2007); Yu-hui et al. (2010), statistical learning methods Kushman et al. (2014); Zhou et al. (2015); Mitra and Baral (2016); Roy and Roth (2017); Huang et al. (2016), semantic-parsing methods Shi et al. (2015); Koncel-Kedziorski et al. (2015); Huang et al. (2017)

to recent deep learning methods 

Ling et al. (2017); Wang et al. (2017); Huang et al. (2018); Robaidek et al. (2018); Wang et al. (2018, 2019); Chiang and Chen (2019); Xie and Sun (2019); Zhang et al. (2020a).

In particular, Deep Neural Solver (DNS) Wang et al. (2017) is a pioneering work that designs a Seq2seq model to solve MWPs and achieves promising results. Xie and Sun (2019) propose a tree-structured neural solver to generate the solution tree in a goal-driven manner. All these neural solvers learn the model with full supervision, where the ground-truth intermediate representations (e.g., expressions, programs) are given during training. To learn the solver with less supervision, Koncel-Kedziorski et al. (2015) use a discriminative model to solve MWPs in a weakly-supervised way. They utilize separate modules to extract features, construct expression trees, and score the likelihood, which is different from the current end-to-end neural solvers. Upadhyay et al. (2016), Zhou et al. (2015), and Kushman et al. (2014) use mixed supervision, where one dataset has only annotated equations, and the other has only final answers. However, for the set with final answers, they also depend on pre-defined equation templates. Chen et al. (2020) apply a neural-symbolic reader on MathQAAmini et al. (2019), which is a large-scale dataset with fully-specified operational programs. They have access to the ground truth programs for a small fraction of training samples at the first iterations of training.

Unlike these methods, the proposed LBF method requires only the supervision of the final answer and generates diverse solutions by keeping a memory buffer. Notably, it addresses the sparse reward problem in policy gradient methods using a fixing mechanism that propagates error down a solution tree and finds the most probable fix.

Neural-Symbolic Learning for NLP

Neural-symbolic learning has been applied to solve NLP tasks with weak supervision, such as semantic parsing and program synthesis Liang et al. (2016a); Guu et al. (2017); Liang et al. (2018); Agarwal et al. (2019); Li et al. (2020b). Similar to MWP, they generate intermediate symbolic representations with a neural network and execute the intermediate representation with a symbolic reasoning module to get the final result. Typical approaches for such neural-symbolic models use policy gradient methods like REINFORCE since the symbolic execution module is non-differentiable. For example, Neural Symbolic Machines Liang et al. (2016b) combines REINFORCE with a maximum-likelihood training process to find good programs. Guu et al. (2017) augment reinforcement learning with the maximum marginal likelihood so that probability is distributed evenly across consistent programs. Memory Augmented Policy Optimization (MAPO) Liang et al. (2018) formulates its learning objective as an expectation over a memory buffer of high-reward samples and a separate expectation outside the buffer, which helps accelerate and stabilize policy gradient training. Meta Reward Learning Agarwal et al. (2019) uses an auxiliary reward function to provide feedback beyond a binary success or failure. Since these methods can only learn from sparse successful samples, they suffer from cold start and inefficient exploration of large search spaces. Recently, Dai and Zhou (2017), Dai et al. (2019), and Zhou (2019b) introduce abductive learning, which states that human misperceptions can be corrected via abductive reasoning. In this paper, we follow the abductive learning method Li et al. (2020a) and propose a novel fixing mechanism to learn from negative samples, significantly accelerating and stabilizing the weakly-supervised learning process. We further design the tree regularization and memory buffer techniques to efficiently shrink and explore the solution space.

Weakly-Supervised MWPs

Figure 2: Overview of our proposed learning-by-fixing (LBF) method. It shows the process for learning the example in Figure 1. LBF works by iteratively exploring the solution space and learning the MWP solver. Exploring: the problem first goes through the GTS module and produces a tentative solution using tree regularization. Then the fixing mechanism diagnoses this solution by propagating the correct answer in a top-down manner. The fixed solution is then added to the memory buffer. Learning

: all solutions in the memory buffer are used as pseudo labels to train the GTS module using a cross-entropy loss function.

In this section, we define the weakly-supervised math word problems and describe the goal-driven tree model originated from Xie and Sun (2019). Then we introduce the proposed learning-by-fixing method, as also shown in Figure 2.

Problem Definition

A math word problem is represented by an input problem text

. The machine learning model with parameters

requires to translate into an intermediate expression , which is executed to compute the final answer . In fully-supervised learning, we learn from the ground truth expression and the final answer . The learning objective is to maximize the data likelihood , where computing given is a deterministic process. In contrast, in the weakly-supervised setting, only and are observed, while is hidden. In other words, the model is required to generate an unknown expression from the problem text. The expression is then executed to get the final answer.

Goal-driven Tree-Structured Model

A problem text consists of words and numeric values. The model takes in problem text and generates a solution tree . Let denote the ordered list of numeric values in according to their order in the problem text. Generally, may contain constants , mathematical operators , and numeric values from the problem text . Therefore, the target vocabulary of is denoted as and it varies between problems due to different .

To generate the solution tree, we adopt the goal-driven tree-structured neural model (GTS) Xie and Sun (2019), which first encodes the problem text into its goal and then recursively decomposes it into sub-goals in a top-down manner.

Problem Encoding. Each word of the problem text is encoded into a contextual representation. Specifically, for a problem , each word is first converted to a word embedding . Then the sequence of embeddings is inputted to a bi-directional GRU Cho et al. (2014) to produce a contextual word representation: where are the hidden states of the forward and backward GRUs at position , respectively.

Solution Tree Generation.

The tree generation process is designed as a preorder tree traversal (root-left-right). The root node of the solution tree is initialized with a goal vector

For a node with goal q, we first derive a context vector c by an attention mechanism to summarize relevant information from the problem:


where and are trainable parameters. Then the goal q and the context c are used to predict the token of this node from the target vocabulary . The probability of token is defined as:


where is the embedding of token :


where and are two trainable embeddings for operators and constants, respectively. For a number token, its embedding is the corresponding hidden state from the encoder, where is the index of in the problem . The predicted token is:


If the predicted token is a number token or constant, the node is terminated and its goal is realized by the predicted token; otherwise, the predicted token is an operator and the current goal is decomposed into left and right sub-goals combined by the operator. Please refer to the supplementary material for more details about the goal decomposition process.

Answer Calculation. The generated solution tree is transformed into a reasoning tree by creating auxiliary non-terminal nodes in place of the operator nodes to store the intermediate results, and the original operator nodes are attached as child nodes to the corresponding auxiliary nodes. Then the final answer is calculated by executing to the value of the root node in a bottom-up manner.


Fixing Mechanism

Drawing inspiration from humans’ ability to correct and learn from failures, we propose a fixing mechanism to correct the wrong solution trees via abductive reasoning following Li et al. (2020a) and use the fixed solution trees as pseudo labels for training. Specifically, we find the most probable fix for the wrong prediction by back-tracking the reasoning tree and propagating the error from the root node into the leaf nodes in a top-down manner.

The key ingredient in the fixing mechanism is the 1-step fix (1-FIX) algorithm which assumes that only one symbol in the reasoning tree can be substituted. As shown by the 1-Fix function in Algorithm 1, the 1-step fix starts from the root node of the reasoning tree and gradually searches down to find a fix that makes the final output equal to the ground-truth. The search process is implemented with a priority queue, where each element is defined as a fix-tuple :

  • [leftmargin=*,noitemsep]

  • is the current visiting node.

  • is the expected value on this node, which means if the value of is changed to , will execute to the ground-truth answer .

  • is the visiting priority, which reflects the probability of changing the value of .

In 1-FIX, error propagation through the solution tree is achieved by a function, which aims at computing the expected value of a child node from its parent’s expected value. Supposing is ’s child node and is the expected value of , the function works as following:

  • [leftmargin=*,noitemsep]

  • If is ’s left or right child, we directly solve the equation or to get ’s expected value , where denotes the operator.

  • If is an operator node, we try to replace with all other operators and check whether the new expression can generate the correct answer. That is, where is now an operator. If there is no satisfying this equation, the solve function returns none.

Please refer to the supplementary material for the definition of the visiting priority as well as the illustrative example of the 1-FIX process.

To search the neighbors of within multi-step distance, we extend the 1-step fix to multi-step by incorporating a RandomWalk function. As shown in Algorithm 1, if we find a fix by 1-FIX, we return this fix; otherwise, we randomly change one leaf node in the reasoning tree to another symbol within the same set (e.g., operators ) based on the probability in Equation 4. This process will be repeated for certain iterations until it finds a fix for the solution.

1:  Input:
3:  for  do
5:     if  then
6:         return
7:     else
9:  return
11:  function
12:   = PriorityQueue(), = the root node of
14:  while  do
15:     if  then
17:         return
18:     for  do
20:         if not ( and then
22:  return
Algorithm 1 Fixing Mechanism

Solution Space Exploration

Tree Regularization While Li et al. (2020a) assumes the length of the intermediate representation is given, the expression length is unknown in weakly-supervised learning. Thus, the original solution space is infinite since the predicted token decides whether to continue the generation or stop. Therefore, it is critical to shrink the solution space, i.e., control the size of the generated solution trees. If the size of the generated solution tree varies a lot from the target size, it would be challenging for the solution or its fix to hit the correct answer. Although the target size is unknown, we observe a positive correlation between the target size and the number of quantities in text. Regarding this observation as a tree size prior, we design a tree regularization algorithm to generate a solution tree with a target size and regularize the size in an empirical range. Denote the size of a solution tree as the number of leaf nodes including quantities, constants, and operators. The prior range of given the length of the numeric value list is defined as:



are the hyperparameters. The effect of these hyperparameters will be discussed in

Table 2.

We further propose a tree regularization algorithm to decode a solution tree with a given size. To generate a tree of a given size , we design two rules to produce a prefix-order expression during the preorder tree decoding:

  1. [leftmargin=*,noitemsep]

  2. The number of operators cannot be greater than .

  3. Except the -th position, the number of numeric values (quantities and constants) cannot be greater than the number of operators.

These two rules are inspired by the syntax of prefix notation (a.k.a, normal Polish notation) for mathematical expressions. The rules shrink the target vocabulary in Equation 6 so that the tree generation can be stopped when it reaches the target size. Figure 3 shows illustrative examples of the tree regularization algorithm.

With tree regularization, we can search the possible fixes within a given range of tree size for each problem.

Figure 3: Tree regularization for the problem in Figure 1 given different target sizes. The three columns are the generated tokens, the effective rules, and the target vocabularies shrunk by the rules, respectively.

Memory Buffer. We adopt a memory buffer to track and save the discovered fixes for each problem. The memory buffer enables us to seek multiple solutions for a single problem and use all of them as pseudo labels for training, which encourages diverse solutions. Formally, given a problem and its buffer , the learning objective is to minimize the negative log-likelihood of all fixed expressions in the buffer:


Learning-by-Fixing Framework

The complete learning-by-fixing method is described in Algorithm 2. In the exploring state, we use the fixing mechanism and tree regularization to discover possible fixes for the wrong trees generated by the neural network, and put them into a buffer. In the learning stage, we train the model with all the solutions in the memory buffer by minimizing the loss function in Equation 8.

1:  Input: training set
2:  memory buffer , the GTS model
3:  for  do
4:     Exploring
5:      = GTS ()
6:     )
7:     if  and  then
9:     Learning
Algorithm 2 Learning-by-Fixing

Experimental Results

Experimental Setup

Dataset. We evaluate our proposed method on the Math23K dataset Wang et al. (2017). It contains 23,161 math word problems annotated with solution expressions and answers. For the weakly-supervised setting, we only use the problems and final answers and discard the expressions. We do cross-validation following the setting of Xie and Sun (2019).

Evaluation Metric. We evaluate the model performance by answer accuracy, where the generated solution is considered correct if it executes to the ground-truth answer. Specifically, we report answer accuracies of all the top- predictions using beam search. It evaluates the model’s ability to generate multiple possible solutions.

Models. We conduct experiments by comparing our methods with variants of weakly-supervised learning methods. Specifically, we experiment with two inference models: Seq2Seq with bidirectional Long Short Memory network (BiLSTM) Wu et al. (2016) and GTS Xie and Sun (2019), and train with four learning strategies: REINFORCE, MAPO Liang et al. (2018), LBF, LBF-w/o-M (without memory buffer). MAPO is a state-of-the-art method in semantic parsing task that extends the REINFORCE with augmented memory. Both models are also trained with the tree regularization algorithm. We also compare with the fully-supervised learning methods to demonstrate our superiority in generating diverse solutions. In the ablative studies, we analyze the effect of the proposed tree regularization and the length of search steps in fixing mechanism.

Comparisons with State-of-the-art

Table 1 summarizes the answer accuracy of different weakly-supervised learning methods and the state-of-the-art fully-supervised approaches. The proposed learning-by-fixing framework significantly outperforms the policy gradient baselines like REINFORCE and MAPO, on both the Seq2seq and the GTS models. It demonstrates the strength of our proposed LBF method in weakly-supervised learning. The GTS-LBF-fully model is trained by initializing the memory buffer with all the ground-truth expressions. It demonstrates that by extending to the fully-supervised setting, our model maintains the top-1 accuracy while significantly improving solutions’ diversity. We believe that learning MWPs with weak supervision is a promising direction. It requires fewer annotations and allows us to build larger datasets with less cost.

Model Accuracy(%)
Retrieval Robaidek et al. (2018) 47.2
Classification Robaidek et al. (2018) 57.9
LSTM Robaidek et al. (2018) 51.9
CNN Robaidek et al. (2018) 42.3
DNS Wang et al. (2017) 58.1
Seq2seqET Wang et al. (2018) 66.7
Stack-Decoder Chiang and Chen (2019) 65.8
T-RNN Wang et al. (2019) 66.9
GTS Xie and Sun (2019) 74.3
Graph2Tree Zhang et al. (2020a) 74.8 111 We run the code using the same setting as GTS for three times and compute the average accuracy.
GTS-LBF-fully 74.1
Seq2seq REINFORCE 1.2
MAPO 10.7
LBF-w/o-M 44.7
LBF 43.6
MAPO 20.8
LBF-w/o-M 58.3
LBF 59.4
Table 1: Answer accuracy on the Math23K dataset. We compare variants of models with our LBF method.

Convergence Speed

Figure 4 shows the learning curves of different weakly-supervised learning methods for the GTS model. The proposed LBF method converges significantly faster and achieves higher accuracy compared with other methods. Both the REINFORCE and MAPO take a long time to start improving, which indicates the policy gradient methods suffer from the cold-start and need time to accumulate rewarding samples.

Figure 4: The learning curves of the GTS model using different weakly-supervised learning methods.

Diverse Solutions with Memory Buffer

To evaluate the ability to generate diverse solutions, we report the answer accuracies of all the top-1/3/5 solutions on the test set using beam search, denoted as Acc@1/3/5, as shown in Table 2. In the weakly-supervised scenario, GTS-LBF achieves slightly better Acc@1 accuracy and much better Acc@3/5 accuracy than GTS-LBF-w/o-M. In the fully supervised scenario, GTS-LBF-fully achieves comparable Acc@1 accuracy and much better Acc@3/5 accuracy than the original GTS model. Particularly, GTS-LBF-fully outperforms GTS by 21% and 26% in terms of Acc@3/5 accuracy. It reveals the efficacy of the memory buffer in encouraging diverse solutions in both weakly-supervised learning and fully-supervised learning.

Model Tree Size Acc@1 Acc@3 Acc@5
Fully Supervised
GTS 74.3 42.2 30.0
GTS-LBF-fully 74.1 63.4 56.3
Weakly Supervised
GTS-LBF- w/o-M [1,) 0 0 0
[2n-1,2n+1] 55.3 26.2 19.3
[2n-1,2n+3] 58.3 27.7 20.3
[2n-3,2n+5] 56.7 27.7 20.6
GTS-LBF [1,) 0 0 0
[2n-1,2n+1] 56.7 45.3 39.1
[2n-1,2n+3] 59.4 49.6 45.2
[2n-3,2n+5] 57.6 49.3 45.2
Table 2: Answer accuracies of all the top-1/3/5 solutions decoded using beam search, denoted as Acc@1/3/5.
Figure 5: Qualitative results on the Math23K dataset. We visualize the solution trees generated by our method.

Qualitative Analysis

We visualize several examples of the top-5 predictions of GTS-LBF in Figure 5. In the first example, the first solution generated by our model is to sum up the prices of a table and a chair first, and then multiply it by the number of pairs of tables and chairs. Our model can also produce another reasonable solution (the fifth column) by deriving the prices of tables and chairs separately and then summing them up.

One caveat for the multiple solutions is that some solutions have different solution trees but are equivalent by switching the order of numeric values or subtrees, as shown in the first four solutions of the first problem in Figure 5. In particular, multiplication and addition are commutative, and our model learns and exploits this property to generate equivalent solutions with different tree structures.

Right Wrong Spurious
Acc@1 58.6 40.6 0.56
Acc@3 49.3 50.4 0.27
Acc@5 44.9 54.8 0.32
Table 3: Human evaluation on the generated solutions (%).

The first solution to the fourth problem in Figure 5 is a typical error case of our model due to the wrong prediction of the problem goal. Another failure type is the spurious solutions, which are correct but not meaningful answers, such as the second solution of the third problem in Figure 5. To test how frequent the spurious solutions appear, we randomly select 500 examples from the test set, and ask three human annotators to determine whether each generated expression is right, wrong, or spurious. Table 3 provides the human evaluation results, and it shows that spurious solutions are rare in our model.

Ablative Analyses

Tree Regularization.

We test different choices of the hyperparameters defined by Equation 7 in tree regularization. As shown in Table 2, the model without tree regularization, i.e., tree size , fails to converge and gets nearly 0 accuracy. The best range for the solution tree size is , where . We provide an intuitive interpretation of this range: for a problem with quantities, (1) operators are needed to connect quantities, which leads to the lower bound of tree size to ; (2) in certain cases, the constants or quantities are used more than once, leading to a rough upper bound of . Therefore, we use as the default range in our implementations. Empirically, this range covers 88% of the lengths of the given ground-truth expressions in the Math23K dataset, providing an efficient prior for tree size.

Number of Search Steps

Table 4 shows the comparison of various step lengths in the m-FIX algorithm. In most cases, increasing the step length improves the chances of correcting wrong solutions, thus improving the performance.

[33mm]ModelsSteps 1 10 50 (default) 100
Seq2seq-LBF-w/o-M 41.9 43.4 44.7 47.8
Seq2seq-LBF 43.9 45.7 43.6 44.6
GTS-LBF-w/o-M 51.2 54.6 58.3 57.8
GTS-LBF 52.5 55.8 59.4 59.6
Table 4: Accuracy (%) using various search steps.


In this work, we propose a weakly-supervised paradigm for learning MWPs and a novel learning-by-fixing framework to boost the learning. Our method endows the MWP learner with the capability of learning from wrong solutions, thus significantly improving the answer accuracy and learning efficiency. One future direction of the proposed model is to prevent generating equivalent or spurious solutions during training, possibly by making the generated solution trees more interpretable with semantic constraints.

Ethical Impact

The presented work should be categorized as research in the field of weakly-supervised learning and abductive reasoning. It can help teachers in school get various solutions of a math word problem. This work may also inspire new algorithmic, theoretical, and experimental investigation in neural-symbolic methods and NLP tasks.


This work reported herein is supported by ARO W911NF1810296, DARPA XAI N66001-17-2-4029, and ONR MURI N00014-16-1-2007.


  • R. Agarwal, C. Liang, D. Schuurmans, and M. Norouzi (2019) Learning to generalize from sparse and underspecified rewards. In ICML, Cited by: Neural-Symbolic Learning for NLP.
  • A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi (2019) MathQA: towards interpretable math word problem solving with operation-based formalisms. In NAACL-HLT, Cited by: Math Word Problems.
  • Y. Bakman (2007) Robust understanding of word problems with extraneous information. Cited by: Math Word Problems.
  • X. Chen, C. Liang, A. W. Yu, D. Zhou, D. Song, and Q. V. Le (2020) Neural symbolic reader: scalable integration of distributed and symbolic representations for reading comprehension. In ICLR, Cited by: Math Word Problems.
  • T. Chiang and Y. Chen (2019) Semantically-aligned equation generation for solving and reasoning math word problems. ArXiv abs/1811.00720. Cited by: Math Word Problems, Table 1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP. Cited by: Goal-driven Tree-Structured Model.
  • W. Dai, Q. Xu, Y. Yu, and Z. Zhou (2019) Bridging machine learning and logical reasoning by abductive learning. In Advances in Neural Information Processing Systems, pp. 2811–2822. Cited by: Neural-Symbolic Learning for NLP.
  • W. Dai and Z. Zhou (2017) Combining logical abduction and statistical induction: discovering written primitives with human knowledge. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: Neural-Symbolic Learning for NLP.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL-HLT, Cited by: Math Word Problems.
  • C. R. Fletcher (1985) Understanding and solving arithmetic word problems: a computer simulation. Behavior Research Methods, Instruments, & Computers 17, pp. 565–571. Cited by: Math Word Problems.
  • K. Guu, P. Pasupat, E. Z. Liu, and P. Liang (2017) From language to programs: bridging reinforcement learning and maximum marginal likelihood. In ACL, Cited by: Neural-Symbolic Learning for NLP.
  • Y. Hong, J. Wang, Y. Jia, W. Zhang, and X. Wang (2019) Academic reader: an interactive question answering system on academic literatures. Thirty-Third AAAI Conference on Artificial Intelligence. Cited by: Math Word Problems.
  • D. Huang, J. Liu, C. Lin, and J. Yin (2018) Neural math word problem solver with reinforcement learning. In COLING, Cited by: Introduction, Math Word Problems.
  • D. Huang, S. Shi, C. Lin, J. Yin, and W. Ma (2016) How well do computers solve math word problems? large-scale dataset construction and evaluation. In ACL, Cited by: Math Word Problems.
  • D. Huang, S. Shi, C. Lin, and J. Yin (2017) Learning fine-grained expressions to solve math word problems. In EMNLP, Cited by: Math Word Problems.
  • R. Koncel-Kedziorski, H. Hajishirzi, A. Sabharwal, O. Etzioni, and S. D. Ang (2015) Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics 3, pp. 585–597. Cited by: Math Word Problems, Math Word Problems.
  • N. Kushman, L. Zettlemoyer, R. Barzilay, and Y. Artzi (2014) Learning to automatically solve algebra word problems. In ACL, Cited by: Math Word Problems, Math Word Problems.
  • Q. Li, S. Huang, Y. Hong, Y. Chen, Y. N. Wu, and Song-Chun. Zhu (2020a) Closed loop neural-symbolic learning via integrating neural perception, grammar parsing, and symbolic reasoning. In International Conference on Machine Learning (ICML), Cited by: Neural-Symbolic Learning for NLP, Fixing Mechanism, Solution Space Exploration.
  • Q. Li, S. Huang, Y. Hong, and S. Zhu (2020b) A competence-aware curriculum for visual concepts learning via question answering. In

    European Conference on Computer Vision

    Cited by: Math Word Problems, Neural-Symbolic Learning for NLP.
  • C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao (2016a) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020. Cited by: Neural-Symbolic Learning for NLP.
  • C. Liang, J. Berant, Q. V. Le, K. D. Forbus, and N. Lao (2016b) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. In ACL, Cited by: Neural-Symbolic Learning for NLP.
  • C. Liang, M. Norouzi, J. Berant, Q. V. Le, and N. Lao (2018) Memory augmented policy optimization for program synthesis and semantic parsing. In NeurIPS, Cited by: Neural-Symbolic Learning for NLP, Experimental Setup.
  • W. Ling, D. Yogatama, C. Dyer, and P. Blunsom (2017) Program induction by rationale generation: learning to solve and explain algebraic word problems. ArXiv abs/1705.04146. Cited by: Introduction, Math Word Problems.
  • L. Magnani (2009) Abductive cognition: the epistemological and eco-cognitive dimensions of hypothetical reasoning. Vol. 3, Springer Science & Business Media. Cited by: Introduction.
  • A. Mitra and C. Baral (2016) Learning to use formulas to solve simple arithmetic problems. In ACL, Cited by: Math Word Problems.
  • H. Qi, T. Wu, M. Lee, and S. Zhu (2015) A restricted visual turing test for deep scene and event understanding. ArXiv abs/1512.01715. Cited by: Math Word Problems.
  • B. Robaidek, R. Koncel-Kedziorski, and H. Hajishirzi (2018) Data-driven methods for solving algebra word problems. ArXiv abs/1804.10718. Cited by: Math Word Problems, Table 1.
  • S. Roy and D. Roth (2017) Unit dependency graph and its application to arithmetic word problem solving. In AAAI, Cited by: Math Word Problems.
  • S. Shi, Y. Wang, C. Lin, X. Liu, and Y. Rui (2015) Automatically solving number word problems by semantic parsing and reasoning. In EMNLP, Cited by: Math Word Problems.
  • K. Tu, M. Meng, M. Lee, T. E. Choe, and S. Zhu (2014) Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia 21, pp. 42–70. Cited by: Math Word Problems.
  • S. Upadhyay, M. Chang, K. Chang, and W. Yih (2016) Learning from explicit and implicit supervision jointly for algebra word problems. In EMNLP, Cited by: Math Word Problems.
  • L. Wang, Y. Wang, D. Cai, D. Zhang, and X. Liu (2018) Translating math word problem to expression tree. In EMNLP, Cited by: Introduction, Math Word Problems, Table 1.
  • L. Wang, D. Zhang, J. Zhang, X. Xu, L. Gao, B. T. Dai, and H. T. Shen (2019) Template-based math word problem solvers with recursive neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 33 (01), pp. 7144–7151. Cited by: Math Word Problems, Table 1.
  • Y. Wang, X. Liu, and S. Shi (2017) Deep neural solver for math word problems. Copenhagen, Denmark, pp. 845–854. Cited by: Introduction, Introduction, Math Word Problems, Math Word Problems, Experimental Setup, Table 1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: Introduction.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    ArXiv abs/1609.08144. Cited by: Experimental Setup.
  • Z. Xie and S. Sun (2019) A goal-driven tree-structured neural model for math word problems. In IJCAI, Cited by: Introduction, Introduction, Math Word Problems, Math Word Problems, Goal-driven Tree-Structured Model, Weakly-Supervised MWPs, Experimental Setup, Experimental Setup, Table 1.
  • W. Yu, Z. Jiang, Y. Dong, and J. Feng (2020) ReClor: a reading comprehension dataset requiring logical reasoning. ArXiv abs/2002.04326. Cited by: Math Word Problems.
  • M. Yu-hui, Z. Ying, C. Guang-zuo, R. Yun, and H. Rong-huai (2010) Frame-based calculus of solving arithmetic multi-step addition and subtraction word problems. 2010 Second International Workshop on Education Technology and Computer Science 2, pp. 476–479. Cited by: Math Word Problems.
  • C. Zhang, F. Gao, B. Jia, Y. Zhu, and S. Zhu (2019) RAVEN: a dataset for relational and analogical visual reasoning.

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 5312–5322.
    Cited by: Math Word Problems.
  • J. Zhang, L. Wang, R. K. Lee, Y. Bin, J. Shao, and E. Lim (2020a) Graph-to-tree learning for solving math word problems. ACL 2020. Cited by: Math Word Problems, Table 1.
  • W. Zhang, C. Zhang, Y. Zhu, and S. Zhu (2020b) Machine number sense: a dataset of visual arithmetic problems for abstract and relational reasoning. ArXiv abs/2004.12193. Cited by: Math Word Problems.
  • L. Zhou, S. Dai, and L. Chen (2015) Learn to solve algebra word problems using quadratic programming. In EMNLP, Cited by: Math Word Problems, Math Word Problems.
  • Z. Zhou (2019a) Abductive learning: towards bridging machine learning and logical reasoning. Science China Information Sciences 62, pp. 1–3. Cited by: Introduction.
  • Z. Zhou (2019b) Abductive learning: towards bridging machine learning and logical reasoning. Science China Information Sciences 62, pp. 1–3. Cited by: Neural-Symbolic Learning for NLP.
  • Y. Zhu, T. Gao, L. Fan, S. Huang, M. Edmonds, H. Liu, F. Gao, C. Zhang, S. Qi, N. Y. Wu, B. J. Tenenbaum, and S. Zhu (2020) Dark, beyond deep: a paradigm shift to cognitive ai with humanlike common sense. Engineering. Cited by: Math Word Problems.