1 Introduction
Automatically solving natural language described mathematical problems has been shown very challenging, requiring natural language understanding, mathematical expression extraction, and complex symbolic reasoning. Existing deep learningrelated methods mainly frame these problems as a machine translation task. A branch of the methods explicitly encode the structural relation and try to directly output the answers
(Saxton et al., 2019; Schlag et al., 2019). These methods have a great expression ability, but are hard to generalize to unseen cases. Another branch learns a mapping from the problem description to a solution program (Wang et al., 2017; Amini et al., 2019), which explicitly encodes domain knowledge. These programbased methods rely heavily on human labeling, which is not only laborious, timeconsuming, and errorprone. Besides, some problems are hard to be expressed in a program format, such as the varieties of probability problems (e.g.,
Three letters picked without replacement from idiidauauuiuaiduaiiu. What is prob of sequence iaa?).Recently, the abductive learning (Dai et al., 2019)
introduces a discrete logic module into a neural network with an integrated learning procedure. The logic module utilizes the logical consistency between the perception outputs and the logic background knowledge to optimize the perception module and the logic module jointly. This work demonstrates the possibility to produce a system with both the flexible perception power from neural networks and the generalization power from the programmed knowledge.
In this paper, we follow the abductive learning framework and propose a system that integrates the transformer networks and a mathematical symbolic library, ABLSym, for automatically solving math problems. ABLSym firstly runs a consistency check and correction procedure: it generates programs from natural language descriptions and uses a program executor to run the programs; if the program output is inconsistent with the answer, it employs a search routine to correct the program. ABLSym then learns from the problem descriptions and the corrected programs. ABLSym repeats the two steps to improve its model. We evaluate ABLSym on the mathematics dataset from
(Saxton et al., 2019). The results show that ABLSym significantly outperforms the previous stateoftheart approaches: it achieves 9.72% accuracy improvement on interpolation tasks, and 47.22% accuracy improvement on extrapolation tasks.2 Background
2.1 The Mathematics Dataset
Saxton et al. (Saxton et al., 2019) introduced a mathematics dataset that contains a variety of math problems, including algebra, arithmetic, numerical comparison, numerical factorization, calculus, measurement, and probability. Each problem is a questionanswer pair, where the question is like Let q(m) = m**3 + 2. Let r(c) = 4*c**3  9. What is 18*q(f) + 4*r(f)? and the answer is like 2*f**3. Although there may be many forms of answer sequences with the same mathematical meaning, the evaluation criterion is characterbycharacter (i.e., each question is scored by either 0 or 1 according to whether the answer matches the correct answer characterbycharacter). The dataset is procedurally generated and consists of 56 modules, and each module provides 2M pergenerated training samples and 10k interpolation samples. Extrapolation samples are also provided for an additional measure of algebraic generalization.
2.2 Sympy
Sympy (Meurer et al., 2017) is a mathematical symbolic computing library, which contains about 300+ mathematical functions. Although many mathematical engines can be used, we adopt Sympy because it can conveniently get all the appropriate mathematical functions, easily exclude nonmathematical functions, and support direct access to the docstrings of mathematical functions.
2.3 Related work
A mathematics dataset was released in (Saxton et al., 2019) that analyzes the reasoning and generalization ability of popular reasoning neural architectures such as recurrent neural architectures and attentionaugmented architectures (i.e., Transformer (Vaswani et al., 2017)). The results show that the learned models did not do mathematical reasoning well, particularly for the extrapolation zone. (Schlag et al., 2019)
incorporates the tensorproduct representation technique within the Transformer to better support the explicit representation of relation structure. They achieved improved results than the vanilla Transformer architecture without introducing any domain knowledge.
Program format is a typical way to represent both of the discrete domain knowledge and the solution structure of mathematical problems. Amini et al. (Amini et al., 2019) released a dataset of math word problems that are densely annotated with programs by crowdsourcing. Based on the dataset, they proposed a sequencetoprogram model with automatic problem categorization. Comparing with their method, our approach applying to the dataset without annotated programs, and moreover, we use both the neural network and the discrete symbolic system for prediction.
Abductive learning (Dai et al., 2019) was recently proposed for connecting a perception module with an abductive logical reasoning module using consistency optimization. The perception module generates output, the reasoning module checks and corrects the logical consistency, and the consistency information is used to update the perception module to generate logically more consistent output. This constitutes a forward cycle. Our approach is inspired by the above abductive learning framework, while we are addressing a different domain.
3 ABLSym
In the following subsections, we introduce the program definition, the program correction, and the training procedure.
3.1 Programs
We define the program based on a domainspecific language (DSL) instead of arbitrary Turingcomplete languages to reduce the search space of programs. Every word in the DSL is called an operator. All available operators form an operator space. The relationship between adjacent operators is appropriately restricted, such as argc operators must be followed by math operators, the number of optional variables must be no less than argc, and argc must be an available number of arguments to the mathematical operator.
3.1.1 Operator Space
The operator space consists of about 400 operators, including mathematical operators, positionaware operators, and several auxiliary operators.
Mathematical Operators: We use Sympy as the program executor. In Sympy, there are about 300+ functions which are essential for solving math problems(e.g. add, multiply, solve, diff). We consider these functions as our mathematical operators.
Position Operators:
Mathematical expressions in problem may appear anywhere. We tokenize the problem sentence with a simple tokenizer and use positional indexes to identify expressions. The tokenizer uses space to tokenize the sentence and uses tokens that are not in the ordinary word dictionary as expressions. The ordinary word dictionary consists of nondigit words and excludes common ordinal number words (e.g., first, second, square). In addition, we also exclude az single letters because they are often used to represent variables in math problems. After tokenizing the problem, the positional continuous expression tokens are merged into one token. We use
pos0, pos1, pos2, … as positional operators to represent the positions of related expressions.Auxiliary Operators
Functions in Sympy may have multiple parameters (e.g. diff function for obtaining derivative may have two usages: diff(x**2+x
y, x), diff(x**2+x*y, x, 2). We add argc0, argc1, argc2, argc3 to the operator space in order to explicitly specify the number of function parameters. Some expressions in question do not conform to the input format of mathematical operators, and the output formats of some operators do not conform to the answer, so we add several additional format conversion operators and operator wrappers into the operator space.
3.1.2 Program Executor
We build a simple program executor based on Sympy to run programs. In a running, the program’s operators are executed sequentially, and intermediate results are saved in the environment through registry variables, which may be used by later operators. If an error is encountered during execution, execution will stop and return none, or if execution reaches the end, return execution .
3.1.3 Programs Search procedure
The program search space is too large to find the correct programs by random search. We design an abductive learning framework to search programs efficiently. Our framework performs multiple iterative searches. In the first iteration, we use a searchbased method as a program generator to generate some programs. Then, the program executor runs the programs, and a consistency checker filters out the programs whose results are inconsistent with the answers. A neural network model is used to learn the mapping from the problem to the correct program. The learned model is then used to be a better program generator to start another iteration. In addition, we develop the following techniques to speed up the search process further.
Warmup Operator Distribution
In math problems, problems are often strongly related to mathematical terms (e.g. in the derivative problems, the terms derivative, differentiate often appear). Additionally, almost every mathematical function in Sympy has a docstring, which usually contains related mathematical terms. So we can build relationships between problems and mathematical operators. In this paper, we adopt (Arora et al., 2016)
method to calculate the cosine similarity between the problem description and the docstring of an operator and then normalize by softmax to obtain the probability distribution of operators, which is used to generate the possible programs.
Curriculum Search Strategy
According to whether the problem consists of simple problems, the problems in the Mathematics Dataset can be divided into simple problems and compositional problems (e.g. a compositional problem: Suppose 2*v + 1873 = 4*x  3*x, x = 2*v  1863. Let u = 65 + 25. Find the common denominator of 1/6 and v/(920)  8/u.). Programs for simple problem can be found relatively easily by searching, but not for compositional problem. We observe that the compositional problem can be broken down into multiple parts, each of which is similar to a simple problem (e.g., the above problem can be broken down into three parts: Suppose 2*v + 1873 = 4*x  3*x, x = 2*v  1863#Let u = 65 + 25#Find the common de
nominator of 1/6 and v/(920)  8/u). Therefore, we use the neural network model learned from simple problems to generate possible programs for each part and organize them into complete programs. The program executor then executes the programs to get results, and the consistency checker then checks the results for correctness.
The whole search process is timeconsuming, so we only perform search on randomly generated 500k problems that meet the qualifying conditions, and use the learned model to generate the rest.
3.2 Neural Models
The neural network model we use is a modified version of the original Transformer (Vaswani et al., 2017), with a shared transformer encoder and two separate transformer decoders and . We use the encoder with hidden states to encode the problem . The decoders and take the shared hidden states and autoregressively generates the answer sequence and program sequence respectively. During training, the decoders receive the shifted targets while during inference, we use the previously generated symbols with the highest probability. We treat the question and answer as a sequence of characters just like (Vaswani et al., 2017) and treat the question as a sequence of operators. The overall training loss is the weighted sum of the answer decoding loss and the program decoding loss:
weights  steps  interpolation  extrapolation  

acc  >95%  acc  >95%  
Transformer (Saxton et al.)  30M  500k  76.00%  13  50.00%  1 
TPTransformer (Schlag et al.)  49.2M  700k  80.67%  18  52.48%  3 
Transformer (ours)  44.2M  700k  76.41%  13  50.48%  2 
TPTransformer (ours)  49.2M  700k  79.82%  18  51.99%  3 
ABLSym+Transformer (ours)  54.9M  700k  87.85%  29  73.41%  7 
ABLSym+TPTransformer (ours)  58.8M  700k  88.52%  33  77.26%  8 
4 Experiments
We evaluate our framework on the mathematics dataset (Saxton et al., 2019). The reason we did not evaluate on other mathematical datasets (Kushman et al., 2014; Huang et al., 2016; Upadhyay & Chang, 2016; Wang et al., 2017; Ling et al., 2017; Amini et al., 2019) is because these datasets are either limited to narrow specific fields or demanded for manual annotated programs.
4.1 Settings
During the search, the maximum number of sampled programs for each problem is on the first iteration and on the other iterations. The number of iterations is set to 5.
We extract a characterlevel vocabulary of 72 symbols and an operatorlevel vocabulary of 380 symbols, both including START, END, and PADDING symbols.
Our transformerlike model parameters , , are set to an embedding size of , with attentional heads, and intermediate feedforward dimension of 2048. The answer decoder is with layers of while the program decoder is with layers of . We train our model via the Adam optimizer (Kingma & Ba, 2014) with a learning rate of , , , . We use a batch size of , with absolute gradient value clipping of . We trained our model on one server with 8 V100 Nvidia GPUs for 12 days. During the search process, the parameters configuration of our programgenerated model is the same as the above model.
At the inference, answers and programs are generated by sequential decoding. If the predicted program is none or fails to run successfully, the neural model answer is used as the final result.
4.2 Experimental Results
Table 1 presents the overall performance on the dataset. We can see that our model significantly outperforms the previous stateoftheart by up to absolute improvement on the interpolation test dataset and absolute improvement on the extrapolation test dataset. Our programaugmented model dramatically improves the performance of the model, especially for generalizing the model to areas not previously seen. For a more detailed comparison, Fig. 1 shows the test performance on extrapolation modules.
Table 2 shows the performance of the 5 iterations of ABLSym together with the random search performance. ABLSym shows clearly better than random search. In the first iteration, it used an average of 20% fewer search times than the random search strategy but found 76% more programs, which mainly due to the warmup strategy and curriculum search strategy. These strategies allow us to search more programs faster within the maximum search limit. After the first iteration, the model we learned as a better program generator generated better candidate programs, so we searched an additional 8% of the programs with negligible search times. Compared to the second iteration, the number of programs searched in the next few iterations increased by only a litter bit. This is because most programs that can be searched are also almost searched. Still, programs were not found.
Method  perquestion searches  hit ratio 

ABLSym (1 itr)  64.14k  33.2% 
ABLSym (2 itrs)  64.86k  40.1% 
ABLSym (3 itrs)  65.49k  41.3% 
ABLSym (4 itrs)  66.11k  42.0% 
ABLSym (5 itrs)  66.73k  42.3% 
Random search  82.09k  18.9% 
ABLSym can find many programs of compositional or complex problems (e.g., Calculate the common denominator of 25/13728 and 121/1248. the program found is pos7 argc1 denom pos5 argc1 denom argc2 lcm), but random search strategy was failed.
5 Conclusion
In this work, we demonstrate that integrating discrete systems into neural systems is a feasible way to enhance the neural systems, particularly in the extrapolation ability. Notice that even human beings learn complex knowledge, e.g. mathematics, progressively from well organized textbooks. Well designed discrete systems may serve the role of textbooks for building a complex intelligent systems.
References
 Amini et al. (2019) Amini, A., Gabriel, S., Lin, P., KoncelKedziorski, R., Choi, Y., and Hajishirzi, H. Mathqa: Towards interpretable math word problem solving with operationbased formalisms. arXiv preprint arXiv:1905.13319, 2019.
 Arora et al. (2016) Arora, S., Liang, Y., and Ma, T. A simple but toughtobeat baseline for sentence embeddings. 2016.
 Dai et al. (2019) Dai, W.Z., Xu, Q., Yu, Y., and Zhou, Z.H. Bridging Machine Learning and Logical Reasoning by Abductive Learning. In Wallach, H., Larochelle, H., Beygelzimer, A., dtextquotesingle AlchéBuc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 2811–2822. Curran Associates, Inc., 2019.
 Huang et al. (2016) Huang, D., Shi, S., Lin, C.Y., Yin, J., and Ma, W.Y. How well do computers solve math word problems? largescale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 887–896, 2016.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kushman et al. (2014) Kushman, N., Artzi, Y., Zettlemoyer, L., and Barzilay, R. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 271–281, 2014.
 Ling et al. (2017) Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
 Meurer et al. (2017) Meurer, A., Smith, C. P., Paprocki, M., Čertík, O., Kirpichev, S. B., Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., et al. Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103, 2017.
 Saxton et al. (2019) Saxton, D., Grefenstette, E., Hill, F., and Kohli, P. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
 Schlag et al. (2019) Schlag, I., Smolensky, P., Fernandez, R., Jojic, N., Schmidhuber, J., and Gao, J. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611, 2019.
 Upadhyay & Chang (2016) Upadhyay, S. and Chang, M.W. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. arXiv preprint arXiv:1609.07197, 2016.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.

Wang et al. (2017)
Wang, Y., Liu, X., and Shi, S.
Deep neural solver for math word problems.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pp. 845–854, 2017.