1 Introduction
Reasoning with numbers is an important skill that occurs in various daytoday scenarios and not surprisingly, numbers are ubiquitous in textual data. To train AI reasoning systems that can perform simple mathematical reasoning, many tasks have been proposed (Dua et al., 2019b; Ravichander et al., 2019; KoncelKedziorski et al., 2016). Despite these efforts, current stateoftheart AI systems are brittle and fail when problems involving similar mathematical reasoning is posed in a slightly different manner. For instance, presenting a word problem in a different manner as shown in fig. 1, while hardly affecting human performance, is sufficient to confuse stateoftheart AI systems^{2}^{2}2The recently released GPT3Instruct, a finetuned model with 175B parameters produces inconsistent answers for these questions. See supplementary material: GPT3Instruct’s Response for more details.. This brittleness in reasoning indicates that the models latch on to spurious signals in the specific dataset resulting in “solving” the dataset while not truly understanding the underlying reasoning skill of simple arithmetic.
Further, we believe that building AI systems that can truly understand and apply simple arithmetic reasoning is a mandatory first step towards successfully tackling complex mathematical reasoning skills Saxton et al. (2019); Hendrycks et al. (2020, 2021).
NumGLUE. To this end, we propose NumGLUE, a multitask benchmark consisting of eight different tasks that at their core test for arithmetic reasoning skills.
For example, as discussed in fig. 1, tasks can involve word problems presented in a slightly different manner or can involve additional reasoning strategies like commonsense reasoning or reading comprehension to be combined with the core skill of simple arithmetic.
Our benchmark consists of four new tasks in addition to four existing ones; with problems spread across eight differet tasks.
The motivation behind NumGLUE is similar to GLUE Wang et al. (2018, 2019), a multitask benchmark that aimed at models that demonstrated superior language understanding by learning the underlying linguistic features.
NumGLUE is designed with goal of progressing towards AI systems that are capable of performing arithmetic reasoning in a general setting; achieving superior performance on our benchmark requires the ability to correctly identify and perform the underlying arithmetic reasoning without relying on task or datasetspecific signals.
Finally, we hope that NumGLUE will encourage systems that perform robust and general numeric reasoning within language, a first step towards being able to perform more complex mathematical reasoning.
Contributions.

[topsep=4pt, leftmargin=12pt, itemsep=0pt]

We introduce NumGLUE– a multitask benchmark consisting of eight different tasks, including 4 new ones, whose solution at its core requires an understanding of simple arithmetic.

We demonstrate that NumGLUE is a challenging benchmark even for stateoftheart large scale language models, obtaining poor scores not only in zero or few shot settings but also after finetuning. This indicates a fundamental barrier for AI systems; one that needs to be breached before complex mathematical challenges can be successfully tackled.

Finally, we propose a memoryaugmented neural model to demonstrate the utility of such a multitask meta dataset. Our proposed model when trained on the entirety of NumGLUE obtains an average improvement of 3.4% on each task as opposed to taskspecific training – indicating that joint training leads to beneficial transfer owing to the common theme of arithmetic reasoning.
2 Related Work
Datasets for Numerical reasoning. Quantitative reasoning has been a challenging problem for a long time. Small question answering datasets were proposed to understand the quantitative aspect of natural language such as the templatebased dataset which solved questions with equations as parameters Kushman et al. (2014), additionsubtraction dataset Hosseini et al. (2014) and arithmetic problems dataset KoncelKedziorski et al. (2015). Difficulty of questions were increased in subsequent datasets Roy and Roth (2016), Upadhyay et al. (2016)
. Later, larger datasets were created to facilitate deep learning research
Ling et al. (2017); Dua et al. (2019b). Several other maths datasets have been proposed to improve explainability Amini et al. (2019), diversity Miao et al. (2020), scale information in language embeddings Zhang et al. and hardness of math questions Hendrycks et al. (2021).One of the motivations behind creating this benchmark is to test for simple arithmetic reasoning independent of the context or the presentation style of the problem. Further, To the best of our knowledge, our work is the first to consider multiple tasks in the numerical reasoning space.
MultiTask Benchmarks. With increased success of deep learning based models on individual tasks, there has been a significant push both in the NLP community and in the broader AI community towards general purpose models that excel at multiple tasks. Naturally, various benchmarks and challenges that test for such understanding have been proposed. For instance, the BAbI dataset Weston et al. (2015), GLUE Wang et al. (2019) and the subsequent harder SuperGLUE Wang et al. (2019) were proposed to both evaluate and drive progress in language understanding via shared linguistic knowledge across tasks. McCann et al. (2018) build a multitask dataset via a novel approach – formatting each task as that of questionanswering. In the more restricted setting of reading comprehension, Dua et al. (2019a) and Downey and Rumshisky build a metadataset that spans multiple domains and reasoning skills.
Multitask Models. With the growing interest towards models that go beyond specific datasets, various neural models that can perform mutliple tasks have been proposed. When the underlying reasoning is similar – eg. commonsense reasoning, problem decomposition or linguistic understanding – it has been found that training on multitask datasets yields more robust and accurate models. For instance, the Multitask Question Answering Network McCann et al. (2018), T5 Raffel et al. (2019), GPT3 Brown et al. (2020) and GPT3Instruct models aim to build general purpose language models that are capable of transferring linguistic understanding across tasks. A similar approach is taken by Khashabi et al. (2020) in the setting of questionanswering and Lourie et al. (2021) in the scope of commonsense reasoning. Further, Muppet Aghajanyan et al. (2021) adds an additional step of prefinetuning between pretraining and finetuning that improves generalization to multiple tasks.
Task  Question Setting  Size  Example 
Task 1  Commonsense + Arithmetic  404  Question: A man can lift one box in each of his hands. How many boxes can a group of 5 people hold in total? Answer: 10 
Task 2  Domain specific + Arithmetic  1620  Question: How many units of are required to react with 2 units of to form 2 units of ? Answer: 2 
Task 3  Commonsense + Quantitative  807  Question: A person wants to get shopping done quickly. They know that they can get through the checkout at big store in 5 minutes whereas it can take 20 minutes at small store. The store they go to finish quickly is? (A) big store (B) small store? Answer: big store 
Task 4  Fillintheblanks  1100  Question: Joan found 70 seashells on the beach. She gave Sam some of her seashells. She has 27 seasshells left. She gave _____ seashells to Sam? Answer: 43 
Task 5  RC + Explicit Numerical Reasoning  54212  Passage: <>. Question: How many counties were added in 1887? Answer: 2 
Task 6  RC + Implicit Numerical Reasoning  32724  Passage: <>. Question: Which player kicked the shortest field goal? Answer: David Akers 
Task 7  Quantitative NLI  9702  Statement 1: James took a 3  hour bike ride, Statement 2: James took a more than 1  hour bike ride, Options: Entailment or contradiction or neutral?, Answer: Entailment 
Task 8  Arithmetic word problems  1266  Question: Joe had 50 toy cars. If he gives away 12 cars, how many cars will he have remaining?, Answer: 38 
3 NumGLUE
As mentioned previously, our NumGLUE benchmark consists of both new and already existing arithmetic reasoning tasks.
We first begin by introducing the novel datasets curated by us before providing a brief overview of existing tasks that are part of NumGLUE.
Finally, in this section, we provide an analysis of the datasets demonstrating that it contains interesting and diverse linguistic and mathematical properties.
NumGLUE Benchmark.
Our proposed NumGLUE benchmark is a collection of eight different tasks that together include questions.
The tasks may either be selfcontained or require additional background knowledge (e.g.commonsense reasoning) to arrive at the final solution; however, all the tasks, at their core, involve arithmetic reasoning.
Table 1 shows an example question belonging to each task along with indicating the total number of data points associated with each task.
It is important to note that tasks are imbalanced with only examples for Task 1 and nearly questions under Task 5.
While we could have undersampled the questions to create a balanced suite, we retain the imbalanced dataset in order to mimic the real world – for instance, arithmetic word problems are more abundant as opposed to word problems that may require commonsense reasoning in addition to arithmetic reasoning.
Data Partition and Evaluation. We randomly partition data in each task into training (70%), development (10%) and test (20%) sets .
In the case of reading comprehension tasks (Task 5 and 6), we assign all questions corresponding to a passage to the same split – we do this in order to discourage any data leakage and thereby, allowing models to potentially rely on memorization to arrive at the correct answer.
For each task, we report the F1 measure and as an aggregate measure of performance on the NumGLUE benchmark similar to Dua et al. (2019b), we report the (unweighted) average of the F1 scores corresponding to each task.
3.1 Novel Datasets
The novel tasks proposed as part of NumGLUE are a combination of both freshly collected data and intelligent modifications of already existing datasets. The four novel arithmetic reasoning tasks introduced are as follows ^{3}^{3}3We annotate the datasets manually. We provide the exact flow used to generate questions of each task in the supplementary materials: Construction of NumGLUE.:
Task 1: Commonsense + Arithmetic Reasoning.
Consider the following question – How many faces do 10 dice have? Answering this not only requires simple arithmetic i.e.multiplying the number of faces in a die by ten but also requires knowing that a standard die has six faces.
We collect this dataset by first asking the annotator to write down a numerical commonsense fact (e.g.a human has 2 hands, a day has 24 hours etc.) and then use frame a question that requires using this numerical fact as part of a simple arithmetic calculation.
Task 2: Domain Specific + Arithmetic Reasoning.
How many units of hydrogen are required to produce 10 units of water?
This question, similar to the previously introduced task of arithmetic reasoning questions, requires additional domainspecific knowledge – specifically, that each unit of water contains two units of hydrogen.
We curate a dataset of such questions that require both domainspecific knowledge and arithmetic reasoning motivated by the finding that QA systems perform poorly on the ARC dataset Clark et al. (2018) consisting of gradeschool level science questions.
Specifically, the dataset collected by us requires understanding of a small set of chemistry (conservation of mass in chemical reactions) and physics principles ().
Task 3: Commonsense + Quantitative Comparison.
A golf ball weighs 40g and a baseball weighs 150g. Which has a higher gravitational force?
Answering this question requires both knowing that mass is directly proportional to gravitational force and a numerical comparison via subtraction.
We collect such quantitative comparison questions by using the QuaRel dataset Tafjord et al. (2019) containing questions from diverse fields such as physics and economics as the starting point.
The annotator chooses a subset of these questions that involve numerically comparable quantities (for instance, in this example, mass of the objects involved) to create the required task of quantitative comparison questions.
Task 4: Fillintheblanks Format. Unlike the previously proposed tasks that require external information (e.g.commonsense knowledge) in addition to simple arithmetic reasoning, this task is selfcontained but a stylistic variant of existing math word problems. We source word problems from the Arithmetic Word Problem repository Roy and Roth (2016, 2017, 2018) and convert them into the fillintheblanks format. For an example of such a conversion, refer to fig. 1.
3.2 Existing Datasets
We now review existing datasets while discussing any modifications made when including them in NumGLUE.
In general, for all the datasets included, we perform a filtering step to clean and control for the quality of the data points being included.
This step includes – a) discarding questions that do not have answer annotations b) eliminating questions with high lexical overlap with the remainder of the dataset and c) fixing any type mismatches present in the data (e.g.“7.0 students” “7 students”).
Task 5: Reading Comprehension (RC) + Explicit Numerical Reasoning.
We select a subset from the DROP Dua et al. (2019b) dataset to create this task.
Specifically, the selected questions involve reading comprehension and numerical reasoning but importantly, the required answer is also a number.
Task 6: Reading Comprehension (RC) + Implicit Numerical Reasoning.
Consider the following question based on a relevant passage – Which state has the highest income tax rate?
Here, while the final answer is a name, arriving at it requires performing comparison (i.e.subtraction). We classify such questions in the DROP dataset as a separate task in
NumGLUE.Task 7: Quantitative NLI EQUATE Ravichander et al. (2019) introduces quantitative NLI questions that require simple arithmetic calculations to be performed in order to accurately classify the relationship between the provided premise and the hypothesis. As noted in fig. 1, many word problems can also be easily converted to this format and is therefore, a diverse and interesting task for evaluating arithmetic reasoning skills of AI systems.
Task 8: Arithmetic Word Problems Finally, we arrive at one of the earliest and extensively studied class of arithmetic reasoning problems i.e.word problems. The specific dataset included as part of our NumGLUEbenchmark is a combination of multiple datasets proposed by KoncelKedziorski et al. (2016), KoncelKedziorski et al. (2015) and Kushman et al. (2014). Further, to ensure that the benchmark as a whole is diverse, we eliminate questions that have a high sentence similarity with questions from the fillintheblanks task.
3.3 Data Quality Analysis:
In order to ensure a highquality test set, three independent annotators evaluate each question in the test set across all tasks. A tiny porton of the data marked as invalid or with disagreement between the annotators was excluded, resulting in a verified, highquality NumGLUE evaluation suite. We also perform a variety of analysis and find that the novel question tasks we created (task 14) have higher quality than the existing question tasks since they have higher average vocabulary (number of unique words per number of samples), higher number of unique nouns, verbs and other POS tags and have less semantic textual similarity among each other (indicating lower repetition). Detailed analysis can be found in the supplementary material: Data Quality Analysis of NumGLUE.
Learning  Baseline  Baseline  Task 1  Task 2  Task 3  Task 4  Task 5  Task 6  Task 7  Task 8  NumGLUE 
category  name  Score  
Heuristic  Taskspecific  Random  0  0.3  46.9  0  0.5  3.4  33  0.4  10.6 
Taskspecific  Majority  1.2  13.9  50  0.5  7.4  3.8  36.5  1.2  14.3  
ZeroShot    GPT3  0  1  11  2  0  17  6  2  4.9 
  GPT3Instruct  2  1  7  3  3  29  17  3  8.1  
FewShot  Taskspecific  GPT3  44  42  46  40  10  42  35  40  37.4 
Taskspecific  GPT3Instruct  40  39  51  33  13  43  35  33  35.9  
Multitask  GPT3  0  3  27  1  7  28  30  4  12.5  
Multitask  GPT3Instruct  1  2  37  2  6  35  31  7  15.1  
Finetuning  Multitask  GPT313B  21.5  40.7  71.2  11.1  6.3  48.2  48.0  14.2  32.7 
Finetuning  Multitask (Qonly)  ExNumNet  1.2  13.2  25.1  0.5  6.1  25.1  32.8  2.4  13.3 
Multitask (Conly)  ExNumNet  1.2  14.2  22.8  19.1  0.6  3  0  9.5  8.8  
Singletask  ExNumNet  0  37.8  50.8  22.2  66.6  71.6  85.9  12.2  43.4  
Multitask  ExNumNet  0  37.5  58  31.4  68.2  70.2  85.7  23.2  46.8  
Multitask + IR  ExNumNet  5.6  37.5  46.6  36.4  68.6  69.6  85.9  22.4  46.6  
Multitask + CIR  ExNumNet  7.4  38.8  58  36.8  69.2  70.8  85.8  23.6  48.8  
Multitask + OS  ExNumNet  7.4  38.8  47.8  35.9  44.3  53.7  85.4  22.4  42.0  
    Human  94.4  94.5  97.8  95  94.7  96.1  96.5  92.8  95.2 
4 Experiments
In this section, we establish multiple baselines on our benchmark and discuss their performance.
4.1 Baselines
We evaluate several baselines on our benchmark – (i) Heuristic, (ii) Zeroshot, (iii) Fewshot, (iv) Finetuning and (v) Human.
We use two kinds of model architectures (i) Neurosymbolic, a memory augmented novel architecture that extends Numnet+v2 Ran et al. (2019) and (ii) Endtoend, GPT3 Brown et al. (2020).
Architectures. In the multitask setting where the same model is trained on all the NumGLUE tasks, we use Reading Comprehension (RC) as the common format – converting each task to RC format via a set of handcoded rules ^{4}^{4}4More details in the supplementary material: ExNumNet.
In addition to being capable of faithfully representing all the constituent tasks, the RC format also allows us to inject additional context in the IR setting without affecting the rest of the pipeline ^{5}^{5}5Henceforth we will be calling our extension to Numnet+v2 as ExNumNet.
On the other hand, GPT3 being a generative model does not require such modifications. Importantly, note that both models are inputted the exact same information for the multitask experiments.
Heuristic Baselines with Task Oracle.
For this baseline, we assume a task oracle that knows the task a particular question belongs (in a multitask setting) – we use this to make our heuristic baselines more competitive.
The first heuristic baseline is random: we randomly select one of the options in case the question has multiple options (task 3 and 7), a number between 0 to 100 for questions having a numerical answer and a random entity present in the passage for questions having a text segment from the passage as the answer.
In the majority baseline, we select the most frequent answer for each task such as "Entailment" for NLI questions and similarly, the most frequent number for questions having numerical answer and the major entity present in the passage for questions having span based answer.
As the task information is known, we include these baselines under taskspecific baselines when discussing results.
Zeroshot and Fewshot Baselines.
We use GPT3 Brown et al. (2020) and the more recent GPT3Instruct^{6}^{6}6newly released by OpenAI as part of the GPT3 finetuned series.
We have two types of few shot baseline (i) task specific and (ii) multi task. In case of task specific fewshot baseline, instances of the same task are used as incontext examples Brown et al. (2020) whereas in case of multitask few shot baseline, instances from all tasks are used to condition the model.
Multitask fewshot is naturally a harder setting as it is taskagnostic. We use default parameters in GPT3 and GPT3Instruct. In fewshot setting, we experiment after feeding as many examples as it can fit within the tokensize. For few shot experiments, we randomly select examples and averaged the results over 5 runs.
Finetuning Baselines.
We first consider variations of the finetuning baselines in the context of our neurosymbolic model, ExNumNet.
We use it as biaschecking baseline – to ensure that solving the benchmark correctly requires considering all of the information presented to it.
To this end, we evaluate the performance of our model when finetuned only on the question (Qonly) or the context (Conly).
Next, we present taskspecific and multitask baselines where ExNumNet is finetuned on individual tasks and the entire NumGLUE benchmark respectively.
With the goal of addressing the data imbalance across the tasks, we include an oversampling baseline that oversamples data from tasks with limited data so as to ensure that the model sees the same number of examples from each constituent task.
In addition, we propose a new architectural modification to ExNumNet. Noting that our baseline model ExNumNet does not take into account external knowledge, we create a new enhanced architecture in the form of a memoryaugmented model that does Information Retrieval (IR) Khot et al. (2019) with respect to a knowledge base we create, MATH KB to identify the needed knowledge. This is inspired by the observation that formula book and mathematical knowledge make the task easier for humans while solving math questions of various types. We then use this knowledge in the ExNumNet setting. Figure 3 illustrates our approach which leverages our newly created knowledge base MATH KB. Conditional IR model is different from the regular IR model in the sense that, IR is performed only for questions of task 1 , 2 and 4, since they require external knowledge to get answered. More details about the model and the IR process can be found in supplementary material: Proposed MemoryAugmented Model (A.5 and A.6).
Finally, we discuss finetuning baselines in the context of endtoend models, specifically GPT3.
We finetune the GPT313B model (for which the finetuning capability has been recently provided by OpenAI ^{7}^{7}7https://beta.openai.com/docs/guides/finetuning) in the multitask setting i.e. the desired setting of the NumGLUE benchmark.
Human Baseline.
Human baseline was calculated on 100 test set samples of each task (81 of Task 1) by averaging the scores of four annotators.
5 Results and Discussion
Table 2 shows the performance of various baseline models on the test set of our benchmark.
Note that the performance of all baseline models is significantly lesser than the human baseline (Figure 2).
We now discuss various insights based on these results.
Does the benchmark contain bias that a model can exploit?
A challenging dataset requires the model to ideally consider all the information provided to it before arriving at an answer.
To ensure that this is indeed the case, we perform ablations where only one portion of the input is provided i.e. either the question or the context.
Both these “biaschecking” baselines perform poorly even in taskspecific setting – indicating that both the benchmark and constituent tasks are challenging.
Which Tasks are Hard to Solve?
Our results show that task 1 which requires numerical commonsense knowledge, is the hardest task to solve.
Similarly, tasks 2, 4 and 8 appear to be comparatively harder from the rest.
One pattern among these tasks is that all of them expect the answer to be numeric.
Numeric answer requires accurate calculation.
So, models might have difficulty in learning the task directly from data.
This hypothesis is also justified from the slight drop in human performance in these tasks..
On the other hand, task 7 has the best performance among all.
Further, we see that performance on task 6 is slightly better than task 5 – although both tasks are sourced from the same dataset, we observe that models answer span based questions better as compared to numeric answers.
Relatively higher performance for task 3 suggests that models find it easier to answer in an MCQ setting.
Does IR Help?
Results show that knowledge help in improving performance of tasks 1, 2 and 4 – where indeed, external knowledge like commonsense or domainspecific knowledge is needed in addition to arithmetic reasoning to arrive at the correct answer.
However, task 3 is an exception to this trend and in fact registers a drop in the score when provided with (unnecessary) additional information; we find that this shortcoming is fixed when using conditional information retrieval (CIR) which in fact leads to the strongest baseline presented in this work.
Does Oversampling help overcome data imbalance across tasks?
Even though oversampling results in higher performance in certain tasks (in comparison with the multitask baseline), specifically the ones with smaller training data, it results in significant drop in performance in the other extreme, i.e tasks with bigger training data. Also, it never performs better than the Conditional IR module in multitask setting.
5.1 Error Analysis
We now present an analysis of the errors made by our baselines to indicate potential avenues for future research.
We analyze errors associated with 50 samples each of the 8 tasks and find that there are mainly 4 categories of error models make: (1) producing invalid output (e.g. answering text where the answer is supposed to be a number, answering a text different from the classes allowed in a classification problem), (2) copying a number from the question instead of calculating the answer, (3) incorrect calculation – this can be due to multiple reasons including (i) using an incorrect operation e.g. subtraction in place of addition, (ii) incorrect parsing of numbers or (iii) incorrect knowledge of numerical commonsense facts. (4) producing redundant text after producing correct answer.
Based on error distribution in Table 3, we observe that the majority of errors come from incorrect calculation.
Further, GPT3 is better than Ex NumNet+v2 in producing valid outputs, but it produces more redundant text.
Future Directions: Bigger model, more data or ?
Table 2 shows that finetuned GPT313B outperforms other baselines on task 1, 2 and 3.
Recall that these tasks require external knowledge and perhaps, this is the reason why GPT3, already pretrained on a diverse webscale text corpus has an edge over other baselines on these tasks.
In case of the smaller ExNumNet, it is interesting that multitask baselines are higher than the single task baselines by 3.4% on average and that information retrieval helps in tasks that require external knowledge. Also notice that, GPT3 is better on smaller datasets and NumNet is better on large datasets. This may indicate that GPT3 is a better fewshot learner but not necessarily a better manyshot learner.
This nonoverlapping performance of GPT3 and Exnumnet, endtoend and neurosymbolic models respectively, indicates that a potential future direction for research is to combine the best of both the models.
Error  ExNumNet  GPT3 
Invalid output  16 %  7% 
Copy number  5 %  3% 
Incorrect calculation  71 %  56% 
Redundant text  8 %  34% 
6 Conclusion
We propose NumGLUE, a multitask benchmark to test for arithmetic understanding. Our benchmark consists of eight tasks including four new ones. While some of the tasks require external knowledge like commonsense or domainspecific information in addition to arithmetic reasoning, some are selfcontained e.g. arithmetic word problems. Further, we demonstrate that our benchmark is far from being solved – with stateoftheart large scale models achieving considerably lower performance than humans. This indicates that current AI systems are incapable of performing simple arithmetic reasoning in a general setting – indicating a fundamental hurdle towards AI systems that understand complex mathematical concepts like differential equations or combinatorics. Finally, we present various baselines including a novel architecture (memory augmented ExNumNet) that demonstrate the advantages of various modeling choices (e.g. endtoend vs neurosymbolic models). Specifically, we show that training in the multitask setting leads to meaningful sharing of knowledge across tasks as evidenced by an average gain of 3.4% on tasks compared to taskspecific modeling. Finally, we hope that our benchmark not only leads to AI systems that are capable of performing simple arithmetic reasoning in a fairly general setting but also results in progress towards more complex mathematical reasoning capability.
Acknowledgements
We thank OpenAI for providing academic access to the GPT3 API, the Aristo team at AI2 for helpful input, the Beaker team for their support with experiments and the anonymous reviewers for their insightful feedback. The support of DARPA SAILON, DARPA CHESS program is gratefully acknowledged.
Ethical Considerations
We have verified that all licenses of source datasets used in this paper allow for their use, modification, and redistribution in a research context. The dataset will be distributed in a manner similar to SuperGLUE Wang et al. (2019) i.e. give full credit assignment to the original data and task creators.
References
 Muppet: massive multitask representations with prefinetuning. arXiv preprint arXiv:2101.11038. Cited by: §2.
 MathQA: towards interpretable math word problem solving with operationbased formalisms. arXiv preprint arXiv:1905.13319. Cited by: §2.
 Realtime visual feedback for educative benchmark creation: a humanandmetricintheloop workflow. Cited by: §A.4.
 Language models are fewshot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. Cited by: §2, §4.1.
 Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: §3.1.
 [6] Getting closer to ai complete question answering: a set of prerequisite real tasks. Cited by: §2.
 ORB: an open reading benchmark for comprehensive evaluation of machine reading comprehension. arXiv preprint arXiv:1912.12598. Cited by: §2.
 DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161. Cited by: §1, §2, §3.2, §3.
 Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324. Cited by: §A.4.
 Measuring massive multitask language understanding. In International Conference on Learning Representations, Cited by: §1.
 Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: §1, §2.

Learning to solve arithmetic word problems with verb categorization.
In
In Conference on Empirical Methods in Natural Language Processing (EMNLP
, Cited by: §2.  UnifiedQA: crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700. Cited by: §2.
 What’s missing: a knowledge gap guided approach for multihop question answering. arXiv preprint arXiv:1909.09253. Cited by: §4.1.
 Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics 3, pp. 585–597. Cited by: §2, §3.2.
 MAWPS: a math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1152–1157. Cited by: §1, §3.2.
 Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 271–281. Cited by: §2, §3.2.
 Program induction by rationale generation: learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146. Cited by: §2.

UNICORN on rainbow: a universal commonsense reasoning model on a new multitask benchmark.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 35, pp. 13480–13488. Cited by: §2.  The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §2.
 A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 975–984. External Links: Link, Document Cited by: §2.

Our evaluation metric needs an update to encourage generalization
. arXiv preprint arXiv:2007.06898. Cited by: §A.4.  Dqi: measuring data quality in nlp. arXiv preprint arXiv:2005.00816. Cited by: §A.4.
 Do we need to create big datasets to learn a task?. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Online, pp. 169–173. External Links: Link, Document Cited by: §A.4.

Exploring the limits of transfer learning with a unified texttotext transformer
. arXiv preprint arXiv:1910.10683. Cited by: §2.  NumNet: machine reading comprehension with numerical reasoning. arXiv preprint arXiv:1910.06701. Cited by: §4.1.
 EQUATE: a benchmark evaluation framework for quantitative reasoning in natural language inference. arXiv preprint arXiv:1901.03735. Cited by: §1, §3.2.
 Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413. Cited by: §2, §3.1.
 Unit dependency graph and its application to arithmetic word problem solving. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §3.1.
 Mapping to declarative knowledge for word problem solving. Transactions of the Association for Computational Linguistics 6, pp. 159–172. Cited by: §3.1.
 Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557. Cited by: §1.
 Dataset cartography: mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9275–9293. Cited by: §A.4.
 Quarel: a dataset and models for answering questions about qualitative relationships. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7063–7071. Cited by: §3.1.
 Learning from explicit and implicit supervision jointly for algebra word problems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 297–306. Cited by: §2.
 Superglue: a stickier benchmark for generalpurpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3261–3275. Cited by: §1, §2, Ethical Considerations.
 Glue: a multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §1, Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks.
 Towards aicomplete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698. Cited by: §2.
 [38] Do language embeddings capture scales?. Cited by: §2.
Appendix A Supplemental Material
a.1 NumGLUE vs Other Datasets:
As figure 4 shows, we select each task from one of the clusters of numerical reasoning datasets (except the multimodel reasoning cluster since we wanted to limit our dataset to text only).
a.2 Construction of NumGLUE :
a.3 GPT3Instruct’s Response
a.4 Data Quality Analysis of NumGLUE
In this section, we discuss various linguistic and statistical properties of our benchmark; ones that we believe result in the quality, diversity and challenging nature Gururangan et al. (2018); Mishra et al. (2020b); Mishra and Sachdeva (2020); Swayamdipta et al. (2020); Mishra et al. (2020a); Arunkumar et al. (2020) of the proposed NumGLUE benchmark.
Vocabulary Size. First, we calculate vocabulary size of each task by finding the number of unique words across all questions. Since our dataset is unbalanced in terms of question task, we find the average vocabulary size by dividing vocabulary size with number of data in that task.
Which Data has Higher Average Vocabulary? As illustrated in Figure 11(a), most of the tasks belonging to the novel dataset category have relatively better average vocabulary size. This implies
questions in those tasks have less repetitiveness.
Furthermore, we expand our vocabulary analysis to understand Figure 11(a) better. We dive deep to analyze different parts of speech. Figure 11(b) summarises our analysis. Most of the novel datasets have more average number of nouns, verbs and adjectives implying there are more varieties of entities, actions and attributes. This further means that datasets belonging to the novel category are more diverse in nature.
Sentence Similarity Analysis
We extend our analysis to reinforce our inference from the word vocabulary analysis. We find Semantic Textual Similarity (STS) of a sentence with every other sentence.
Which Data Consists of Most Dissimilar Sentences?
As depicted by Figure 11(c)11(f), most questions in QuaRel have high similarity value with other questions indicating the repetitiveness of data. Same is true for majority of EQUATE data. DROP also has high similarity among questions.
However, similarity among questions in our dataset is significantly less.
Some similarity boxes can be seen in the chart. They are mostly due to task 2 data, and partly due to task 3 data. Lesser similarity implies that our dataset is far less repetitive than others. Also, the repetition in our dataset is sparse and is not equally distributed among the whole dataset unlike others. This way, our dataset is more diverse.
Note that question in Task 2 have lower vocabulary and further, a higher similarity as well.
As a small set of chemistry and physics principles are used to generate questions, the result is a fairly templated or uniformlooking dataset – leading to the observed reversal of trends in this particular task.
a.5 ExNumNet
Figure 13 illustrates our baseline model: ExNumNet. This contains a Reading Comprehension Converter module which converts each task of question to reading comprehension format. Figure 14 illustrates various examples of how each task of questions get converted to the reading comprehension format. We add a task converter module to detect task of a question. We design task converter heuristically based on the features associated with questions (e.g. NLI contains "Sentence 1" and "Sentence 2" whereas completion contains a blank). We convert each of the tasks to RC format. For NLI questions, we use the premise sentence as passage, hypothesis as the question and append the string “Entailment, contradiction or neutral?” to the question so that it has a span based answer. For other questions, we tokenize the question string into its constituent sentences and use a heuristic approach to split the question string into passage and question. Furthermore, for option based questions, we append all the options at the end of the question.
a.6 Proposed MemoryAugmented Model
Figure 13 illustrates our baseline model ExNumNet. We add an IR mechanism as described in Algorithm 1 and illustrated in Figure 3 of the main paper. As mentioned in the ‘Baselines’ subsection (Experiments section) of the main paper, we convert each task to RC format in our baseline and append the knowledge retrieved using IR from MATH KB
at the end of the passage. In our experiments, we use the following hyperparameters in the IR process:
, , and .Formalization Let represents dataset, represents sample, represent the MATH KB, represents the number of knowledge statements retrieved for each sample, is the cut off STS (Semantic Textual Similarity) value above which knowledge statements are treated redundant and removed, is the reduction we do iteratively on until statements remain.
We create a knowledge base, MATH KB by accumulating all tasks of external knowledge which are needed to solve questions of various tasks (e.g. human has 2 hands, cow has 4 legs, there are 24 hours in a day etc..). We also add math formulae required to solve questions in our benchmark (e.g. the formula of speed in terms of distance and time). We add alll these in the form of plain text separated by new line. We use Elasticsearch to retrieve relevant knowledge sentences. We further filter them using a heuristic threshold of relevance. We append this knowledge in the beginning of the passage so that continuity is not broken between passage and question. Figure 3 of the main paper illustrates our approach.
a.7 Hyper Parameters Used
All the experiments were ran with the following hyper parameters, batch size was kept at 16 where as the eval batch size was 5. The maximum number of epoch ran for the experiments were 5 with the warmup kept at 0.06. The learning rate used was 1.5e5 and the weight decay was 0.01.
All above hyper parameters were selected using a grid search; we kept rest of the hyper parameters unaltered. All the experiments were performed on "TeslaV100SXM216GB", with which the model takes 24hrs to train on nearly 100k samples.
a.8 Additional Examples
We provide additional examples of task 1, 2, 3 and 4 questions here to better illustrate the novel datasets we have created as part of our NumGLUE.
Question  Knowledge Required  Answer 
Ella and Lily are playing a game that requires 10 die. Find out the total number of faces in 10 die.  A die has 6 faces  60 
Jacob and Lillian are running a km long race. Jacob finished the race when Lillian was 190 meters from the finish line. How many meters did Lillian cover till that time?  1000 meters make a km  810 
A man can lift one box in each of his hands. How many boxes can a group of 5 people hold in total?  A human being has 2 hands  10 
Question  Knowledge Required  Answer 
Find the mass percentage of H in C6H6  Mass of C is 12 units and mass of H is 1 unit  7.69 
How many units of H2 are required to react with 2 units of C2H4 to form 2 units of C2H6  H2 + C2H4 = C2H6  2 
A car covers 912 meters in 19 seconds. If bike’s speed is one fourth of the car. Find the distance covered by the bike in 4 seconds.  distance travelled = speed * time  48 
QuaRel Question  Transformed Question 
A person wants to get shopping done quickly. They know that they can get through the checkout at big store faster than they can at small store. The store they go to to finish quickly is
(A) big store (B) small store 
A person wants to get shopping done quickly. They know that they can get through the checkout at big store in 5 minutes whereas it can take 20 mintues at small store. The store they go to to finish quickly is
(A) big store (B) small store 
Tina is racing her two dogs. Her greyhound is slim, her rottweiler is heavy. The dog that gets faster more quickly is the
(A) rottweiler (B) greyhound 
Tina is racing her two dogs. Her greyhound weighs 88 lbs and her rottweiler weighs 79 lbs. The dog that gets faster more quickly is the
(A) rottweiler (B) greyhound 
A golf ball has a smaller mass then a baseball. Which item has a weaker gravitational field?
(A) golf ball (B) baseball 
A golf ball has a mass of 78 grams and a baseball has a mass of 0.159 Kg. Which item has a weaker gravitational field?
(A) golf ball (B) baseball 
Arithmetic Word Problem  Transformed Question 
Joan found 70 seashells on the beach. She gave Sam some of her seashells. She has 27 seashell left. How many seashells did she give to Sam ? 43  Joan found 70 seashells on the beach . She gave Sam some of her seashells . She has 27 seashells left. She gave seashells to Sam. 43 
Last week Tom had 74 dollars. He washed cars over the weekend and now has 86 dollars. How much money did he make washing cars ? 12  Last week Tom had 74 dollars. He washed cars over the weekend and made another 86 dollars. Tom has dollars now . 160 