1 Introduction
Numbers play a vital role in our lives. We reason with numbers in daytoday tasks ranging from handling currency to reading news articles to understanding sports results, elections and stock markets. As numbers are used to communicate information accurately, reasoning with them is an essential core competence in understanding natural language Levinson (2001); Frank et al. (2008); Dehaene (2011). A benchmark task in natural language understanding is natural language inference (NLI)(or recognizing textual entailment (RTE)) Cooper et al. (1996); Condoravdi et al. (2003); Bos and Markert (2005); Dagan et al. (2006), wherein a model determines if a natural language hypothesis can be justifiably inferred from a given premise^{2}^{2}2Often, this is posed as a threeway decision where the hypothesis can be inferred to be true (entailment), false (contradiction) or cannot be determined. Making such inferences often necessitates reasoning about numbers.
RTEQuant  
P:  After the deal closes , Teva will generate sales of about $ 7 billion a year , the company said . 
H:  Teva earns $ 7 billion a year 
AWPNLI  
P  Each of farmer Cunningham ’s 6048 lambs is either black or white and there are 193 white ones. 
H  5855 of Farmer Cunningham ’s lambs are black. 
NewsNLI  
P  With 99.6% of precincts counted , Dewhurst held 48% of the vote to 30% for Cruz . 
H  Lt. Gov. David Dewhurst fails to get 50% of primary vote. 
RedditNLI  
P  Oxfam says richest one percent to own more than rest by 2016. 
H  Richest 1% To Own More Than Half Worlds Wealth By 2016 Oxfam. 
Consider the example,
P  With 99.6% of precincts counted , Dewhurst held 48% of the vote to 30% for Cruz . 
H  Lt. Gov. David Dewhurst fails to get 50% of primary vote. 
To conclude the hypothesis is inferrable, a model must reason that since 99.6% of the precincts are counted, even if all the remaining precincts vote for Dewhurst, he would still fail to get 50% of the primary vote. Scant attention has been paid to building datasets to evaluate this reasoning ability. To address this gap, we present EQUATE (Evaluating Quantity Understanding Aptitude in Textual Entailment) (§3). EQUATE consists of five evaluation sets: Stress Test, AwpNLI, NewsNLI, RTEQuant, RedditNLI, each featuring different facets of quantitative reasoning in textual entailment (Table 1) (eg: range comparisons, arithmetic reasoning, verbal reasoning involving quantities etc). The test sets contain synthetic data created by repurposing existing NLI and arithmetic word problem datasets, and natural data from news articles and social media .
We evaluate the ability of existing stateoftheart NLI models to perform quantitative reasoning (§4.1), by benchmarking 7 published models on EQUATE. Our results show that most models are incapable of quantitative reasoning, relying on lexical cues for prediction. Additionally, we build QREAS , a shallow semantic reasoning baseline for quantitative reasoning in NLI (§4.2). QREAS is effective on synthetic test sets which contain more quantitybased inference, but shows limited success on natural test sets which require deeper linguistic reasoning. However, the hardest cases require a complex interplay between linguistic and numerical reasoning. The EQUATE evaluation framework makes it clear where this new challenge area for textual entailment stands.
Test Set  Size  Classes  S  Data Source  Annotation Source  Quantitative Phenomena 
Stress Test  7500  3  ✓  AQuARAT  Automatic  Quantifiers 
RTEQuant  166  2  ✗  RTE2RTE4  Experts  Arithmetic, World knowledge, Ranges, Quantifiers 
AwpNLI  722  2  ✓  Arithmetic Word Problems  Automatic  Arithmetic 
NewsNLI  1000  2  ✗  CNN  Crowdworkers  Ordinality, Quantifiers, Arithmetic, World Knowledge, Magnitude, Ratios 
RedditNLI  250  3  ✗  Experts  Range, Arithmetic, Approximation, Verbal 
2 Related Work
NLI has attracted communitywide interest as a stringent test for natural language understanding Cooper et al. (1996); Fyodorov ; Glickman et al. (2005); Haghighi et al. (2005); Harabagiu and Hickl (2006); Romano et al. (2006); Dagan et al. (2006); Giampiccolo et al. (2007); Zanzotto et al. (2006); Malakasiotis and Androutsopoulos (2007); MacCartney (2009); de Marneffe et al. (2009); Dagan et al. (2010); Angeli and Manning (2014); Marelli et al. (2014). Recently, the creation of largescale datasets Bowman et al. (2015); Williams et al. (2017); Khot et al. (2018) spurred the development of many neural models Parikh et al. (2016); Nie and Bansal (2017); Conneau et al. (2017); Balazs et al. (2017); Chen et al. (2017a); Radford et al. (2018). However, recent work has identified biases in some datasets Gururangan et al. (2018); Poliak et al. (2018), which neural models exploit as shallow cues for prediction instead of doing expected reasoning for the task Glockner et al. (2018); Naik et al. (2018).
naikEtAl:2018:C181 find that model inability to do numerical reasoning causes 4% of errors made by stateoftheart models. Previously, marneffe2008finding found that in a corpus of reallife contradiction pairs collected from Wikipedia and Google News, 29% contradictions arise from numeric discrepancies, and in many Recognizing Textual Entailment (RTE) datasets, numeric contradictions made up 8.8% of contradictory pairs. sammons2010ask, clark2018knowledge present a thorough analysis of types of reasoning required for inference, arguing for a systematic knowledgeoriented approach by evaluating specific semantic analysis tasks, identifying quantitative reasoning in particular as an area models should concentrate on. Our work takes the first step towards addressing this, by presenting an evaluation framework and a closer examination of quantitative reasoning in NLI.
While to the best of our knowledge, prior work has not studied quantitative reasoning in NLI, roy2017reasoning propose a model for a related subtask called quantity entailment, which aims to determine if a given quantity can be inferred from a sentence. In contrast, generalpurpose textual entailment considers if a given sentence can be inferred from another. We focus on generalpurpose textual entailment. Our work also relates to solving arithmetic word problems Hosseini et al. (2014); Mitra and Baral (2016); Zhou et al. (2015); Upadhyay et al. (2016); Huang et al. (2017); Kushman et al. (2014a); KoncelKedziorski et al. (2015); Roy and Roth (2016); Roy (2017); Ling et al. (2017). Word problems emphasize arithmetic reasoning, and the requirement for linguistic reasoning and world knowledge is limited as the text is concise, straightforward, and selfcontainedHosseini et al. (2014); Kushman et al. (2014b). Our work provides a testbed that evaluates basic arithmetic reasoning while also incorporating the complexity of natural language.
3 Quantitative Reasoning in NLI
Our interpretation of “quantitative reasoning” draws from cognitive testing and education Stafford (1972); Ekstrom et al. (1976)
, which considers it “verbal problemsolving ability”. While inextricably linked to mathematics, it is an inclusive skill involving everyday language rather than a specialized lexicon. To excel at quantitative reasoning, one must interpret quantities expressed in language, perform basic calculations and judge their accuracy, and justify quantitative claims using both verbal and numeric reasoning. Based on these requirements, natural language inference lends itself as a test bed for the study of quantitative reasoning. Conversely, the ability to quantitatively reason is important for NLI
Sammons et al. (2010); Clark (2018). Motivated by this interplay, we present the EQUATE (Evaluating Quantity Understanding Aptitude in Textual Entailment) framework.3.1 The EQUATE Dataset
EQUATE consists of five NLI test sets featuring quantities. These sets (Table 2) are drawn from diverse sources and exhibit a wide range of quantitative reasoning phenomena. Some sets are controlled synthetic tests(§3.2; §4) to examine model ability to handle phenomena such as quantifiers, approximations or arithmetic reasoning. EQUATE also includes tests featuring text from news articles and social media (§3.3); §3.5; §3.6) to examine reasoning about quantities expressed verbally in the wild. Two main restrictions are imposed during test creation. First, we remove all sentences with temporal reasoning, since specialized knowledge is needed to reason about time. Secondly, we focus on sentences containing quantity mentions with numerical values ^{3}^{3}3
This is not detrimental but it reduces the probability of observing phenomena such as vague quantification.
.3.2 Stress Test
We include the numerical reasoning stress test from Naik et al. (2018) as a sanity check. It requires models to match entities from hypothesis to the premise, and reason with quantifiers.
3.3 RTEQuant
This test set is constructed from the RTE subcorpus for quantity entailment Roy (2017), originally drawn from the RTE2RTE4 datasets Dagan et al. (2006). The original subcorpus conflates temporal and quantitative reasoning. Pairs requiring temporal reasoning are discarded, resulting in a set of 166 entailment pairs.
3.4 AwpNLI
To evaluate arithmetic ability of NLI models, we repurpose data from arithmetic word problemsRoy and Roth (2016) which have characteristic structures. First, they establish a world and optionally update its state. Then, a question is posed about the world. This structure forms the basis of our pair creation process (Fig 1). World building and update statements form the premise. A hypothesis template is generated by first identifying modal/auxiliary verbs in the question, and subsequent verbs, which we refer to as secondary verbs. We identify the agent in the sentence and conjugate the secondary verb in present tense followed by the identified unit to form the final template. For every template, the correct guess is used to create an entailed hypothesis. Contradictory hypotheses are generated by randomly sampling a wrong guess ( of correct guess is integer, and if it is a real number) ^{4}^{4}4
from a uniform distribution over an interval of 10 surrounding the correct guess (or an interval of 5 for numbers less than 5)
. We manually examine the dataset for grammaticality, finding only 2% ungrammatical hypotheses to be ungrammatical, which are manually corrected leaving a final test set of 722 pairs.3.5 NewsNLI
This test set is created from the CNN corpus Hermann et al. (2015) of news articles with abstractive summaries. We identify summary points with quantities, filtering out temporal expressions. For each summary point, the two most similar sentences^{5}^{5}5according to Jaccard similarity from the article are chosen, flipping pairs where the premise begins with a firstperson pronoun. The top 50% of similar pairs are retained to avoid lexical overlap bias. We crowdsource annotations for a subset of this data from Amazon Mechanical Turk. To ensure quality, we require that annotators have an approval rate of 95% on atleast 100 prior tasks and pass a qualification test. Crowdworkers are shown two sentences, and asked to determine whether the second sentence is definitely true, definitely false or not inferable given the first. We collect 5 annotations per pair, and consider pairs with lowest token overlap between premise and hypothesis and least difference in premisehypothesis lengths when stratified by entailment label. Top 1000 samples meeting these criteria form our final test set. To validate crowdsourced labels, experts are asked to annotate a subset of 100 pairs. Crowdsourced gold labels match expert gold labels in 85% cases, while individual crowdworker labels match expert gold labels in 75.8%.
3.6 RedditNLI
This test set is sourced from the popular social forum \reddit. Since reasoning about quantities is important in domains like finance or economics, we scrape all headlines from the posts on \r\economics, considering titles that contain quantities and do not have metaforum information. Titles appearing within three days of each other are clustered by Jaccard similarity, and the top 300 pairs are extracted. After filtering out nonsensical titles, such as concatenated stock prices, we are left with 250 sentence pairs. Similar to RTE, two expert annotators label these pairs, achieving a Cohen’s kappa of 0.82. Disagreements are discussed to resolve final labels.
4 Models
We describe the 9 NLI models ^{6}^{6}6Performance statistics of all implementations on the MultiNLI dataset are provided in the supplementary material. Our results closely match the original publications in all cases. used in this study, and our new baseline. The interested reader is invited to refer to the corresponding publications for further details.
INPUT  
Set of “compatible” singlevalued premise quantities  
Set of “compatible” rangevalued premise quantities  
Hypothesis quantity  
Operator set  
Length of equation to be generated  
Symbol list ()  
Type list (set of types from )  
Length of symbol list  
Index of first range quantity in symbol list  
Index of first operator in symbol list  
OUTPUT  
Index of symbol assigned to position in postfix equation  
VARIABLES  
Main ILP variable for position  
Indicator variable: is a single value?  
Indicator variable: is a range?  
Indicator variable: is an operator?  
Stack depth of  
Type index for 
Definitional Constraints  
Range restriction  or for if 
and for if  
for if  
Uniqueness  for 
Stack definition  (Stack depth initialization) 
for (Stack depth update)  
Syntactic Constraints  
First two operands  and 
Last operator  (Last operator should be one of ) 
Last operand  (Last operand should be hypothesis quantity) 
Other operators  for if 
Other operands  for if 
for if  
Empty stack  (Nonempty stack indicates invalid postfix expression) 
Premise usage  for if 
Operand Access  
Right operand  for such that 
Left operand  for where and is the largest index such that and 
Type Consistency Constraints  

Type assignment  for if and 
Two type match  for such that 
One type match  for such that 
for such that  
Operator Consistency Constraints  
Arithmetic operators  for such that 
Range operators  for such that 
for such that 
4.1 NLI Models

Majority Class (MAJ): Simple baseline that always predicts the majority class in test set.

HypothesisOnly (HYP):
FastText classifier trained on only hypotheses to predict the entailment relation
Gururangan et al. (2018). 
ALIGN: A bagofwords alignment model inspired by maccartney2009natural.

CBOW: A simple bagofembeddings sentence representation model nangiaEtAl:2017:RepEval

BiLSTM: The simple BiLSTM model described by nangiaEtAl:2017:RepEval

Chen (CH): Stacked BiLSTMRNNs with shortcut connections and character word embeddings chenEtAl:2017:RepEval.

InferSent (IS):
A singlelayer BiLSTMRNN model with maxpooling. conneauEtAl:2017:EMNLP2017 show that this architecture is able to learn robust universal sentence representations which transfer well across several inference tasks.

RSE: Stacked BiLSTMRNNs with shortcut connections Nie and Bansal (2017).

ESIM: sequential inference model proposed by Chen et al. (2017a) which uses BiLSTMs with an attention mechanism.
4.2 QReas Baseline
Figure 2 presents an overview of the QREAS baseline for quantitative reasoning in NLI. The model manipulates quantity representations symbolically to decide the entailment relation and is weakly supervised, using only the final entailment label as supervision. This baseline has four stages: Quantity mentions are extracted and parsed into semantic representations called Numsets s (§4.2.1, §4.2.2); compatible Numsets are extracted (§4.2.3) and composed (§4.2.4) to form justifications; Justifications are analyzed to determine entailment labels (§4.2.5).
4.2.1 Quantity Segmenter
Inspired by Barwise and Cooper (1981), we consider quantities as having a number, a unit and an optional approximator. We extract quantity mentions by identifying all least common ancestor noun phrases from the constituency parse of the sentence that contain cardinal numbers.
4.2.2 Quantity Parser
Our quantity parser constructs a grounded representation for each quantity mention in the premise and hypothesis, henceforth known as a Numset . A Numset can also be a composition of other Numsets . A Numset consists of (val, unit) tuples with:
1. val []: quantity represented as range
2. unit : unit noun associated with quantity
To extract values for a quantity, we extract cardinal numbers, recording contiguity. We normalize the number ^{7}^{7}7(remove “,”s, convert written numbers to float, decide the numerical value, for example hundred fifty eight thousand is 158000, two fifty eight is 258, 374m is 3740000 etc). If cardinal numbers are nonadjacent, we look for an explicitly mentioned range such as ‘to’ and ‘between’. We also handle simple ratios such as quarter, half etc, and extract bounds (eg: less than 10 apples is parsed to apples.)
To extract units, we examine tokens adjacent to cardinal numbers in the quantity mention and identify known units. If no known units are found, we assign the token in a numerical modifier relationship with the cardinal number, else we assign the nearest noun to the cardinal number as the unit. A quantity is determined to be approximate if the word in a adverbial modifier relation with the cardinal number appears in gazetteer ^{8}^{8}8roughly‘, ‘approximately‘, ‘about‘, ‘nearly‘, ‘roundabout‘, ‘around‘, ‘circa‘, ‘almost‘, ‘approaching‘, ‘pushing‘,‘more or less‘, ‘in the neighborhood of‘, ‘in the region of‘, ‘on the order of‘,‘something like‘, ‘give or take (a few)‘, ‘near to‘, ‘close to‘, ‘in the ballpark of‘ .. If approximate, range is extended to (+/)2% of the current value.
4.2.3 Quantity Pruner
The pruner constructs “compatible” premisehypothesis Numset pairs. Consider the pair “Insurgents killed 7 U.S. soldiers, set off a car bomb that killed four Iraqi policemen” and “7 US soldiers were killed, and at least 10 Iraqis died”. Our parser extracts Numsets corresponding to “four Iraqi policemen” and “7 US soldiers” from premise and hypothesis respectively. But these Numsets
should not be compared as they involve different units. The pruner discards such incompatible pairs. Heuristics to detect unitcompatible
Numset pairs include direct string match or synonymy and hypernymy relations ^{9}^{9}9from WordNet.. Like Roy (2017), we consider two units compatible if one is a nationality or a job^{10}^{10}10Lists of jobs, nationalities scraped from Wikipedia. and the other unit is synonymous with people, person or citizen/worker.4.2.4 Quantity Composition
The composition module detects whether a hypothesis Numset is justified by composing “compatible” premise Numsets . Our framework generates postfix arithmetic equations from premise Numsets , which justify the hypothesis Numset ^{11}^{11}11Direct comparisons are incorporated by adding “=” as an operator. Note, the set of possible equations is exponential in number of Numsets
, making exhaustive generation intractable. A large number of equations are invalid as they violate constraints such as unit consistency. Thus, our framework uses integer linear programming (ILP) to constrain the equation space. It is inspired by prior work on algebra word problems
KoncelKedziorski et al. (2015), with some key differences:1. Arithmetic equations: We focus on arithmetic equations instead of algebraic for NLI.
2. Range arithmetic: Quantitative reasoning involves ranges, which are handled by representing then as endpointinclusive intervals and adding four operators ()
3. Hypothesis quantitydriven: We optimize an ILP model for each hypothesis Numset because a sentence pair is marked “entailment” iff every hypothesis quantity is justified.
Table 3 describes ILP variables. We impose the following types of constraints:
1. Definitional Constraints: Ensure that ILP variables take on valid values by constraining initialization, range and update.
2. Syntactic Constraints: Assure syntactic validity of generated postfix expressions by limiting operatoroperand ordering.
3. Operand Access: Simulate stackbased evaluation correctly by choosing correct operatoroperand assignments.
4. Type Consistency: Ensure that all operations are typecompatible.
5. Operator Consistency: Force range operators to have range operands and mathematical operators to have singlevalued operands.
Definitional, syntactic and operand access constraints ensure mathematical validity while type and operator consistency constraints add linguistic consistency. Constraint formulations are provided in tables 4 and 5. We limit tree depth to 3 and retrieve upto 50 solutions per hypothesis Numset , then solve to determine whether the equation is mathematically correct. We discard equations which use invalid operations (division by 0) or add unnecessary complexity (multiplication/ division by 1). The remaining equation trees are considered plausible justifications .
4.2.5 Global Reasoner
The global reasoner predicts the final entailment label, on the assumption that every Numset in the hypothesis has to be justified for entailment, as a necessary but not sufficient condition. If any numset in the hypothesis does not have a justification, the label is predicted as neutral, whereas if any numset is contradicted by the premise, the prediction is contradiction. Relying on this intuition, the global reasoner collects justifications from the pruner and composition modules and decides the final entailment label using the procedure described in algorithm 1 ^{12}^{12}12
MaxSimilarityClass, takes two quantities and returns a probability distribution over entailment labels based on unit match. Similarly, ValueMatch detects whether two quantities match in value (this function can also handle ranges).
.5 Results and Discussion
MD  NR ST.  RTEQ  AWPNLI  NewsNLI  RedditNLI  Avg.  
MAJ  33.3  0.0  57.8  0.0  50.0  0.0  50.0  0.0  58.4  0.0  49.9 
HYP  31.2  2.1  49.4  0.6  50.1  +0.1  52.0  +2.0  40.8  17.6  44.7 
ALIGN  22.6  10.7  62.1  +4.3  47.2  2.8  54.2  +4.2  34.8  23.6  44.2 
CBOW  30.2  3.1  47.0  10.8  50.7  +0.7  59.7  +9.7  42.4  16.0  46.0 
BiLSTM  31.2  2.1  51.2  6.6  51.3  +1.2  64.3  +14.3  50.8  7.6  49.8 
CH  30.3  3.0  54.2  3.6  50.3  +0.3  64.7  +14.7  55.2  3.2  51.0 
IS  28.8  4.5  66.3  +8.5  50.7  +0.7  67.2  +17.2  29.6  28.8  48.5 
NB  28.4  4.9  58.4  +0.6  50.7  +0.7  68.0  +18.0  49.2  9.2  51.0 
ESIM  21.8  11.5  54.8  3.0  50.1  +0.1  63.6  +13.6  45.6  12.8  47.2 
OpenAI  36.4  +3.1  68.1  +10.3  50.0  +0.0  74.9  +24.9  52.4  6.0  56.4 
QREAS  61.8  +28.0  59.0  +2.3  77.9  +27.9  62.3  +12.3  50.4  8.0  62.3 
Dataset  S  Pa  Pr  C  R 

Stress Test  5  70  0  18  7 
RTEQuant  20  38  3  8  0 
AwpNLI  5  28  26  31  10 
NewsNLI  21  42  21  0  6 
RedditNLI  37  40  2  6  0 
Table 6 presents results on EQUATE. Neural models, particularly OpenAI excel at verbal aspects of quantitative reasoning (RTEQuant, NewsNLI), whereas QREAS excels at numerical aspects (Stress Test, AwpNLI).

Neural Models on NewsNLI To tease apart contributory effects of numerical and verbal reasoning in natural data, we experiment with NewsNLI. We extract all entailed pairs where a quantity appears in both premise and hypothesis, and perturb the quantity in the hypothesis generating contradictory pairs. For example, the pair ’In addition to 79 fatalities , some 170 passengers were injured .’ ’The crash took the lives of 79 people and injured some 170”, ’entailment’ is changed to ’In addition to 79 fatalities , some 170 passengers were injured .’, ’The crash took the lives of 77 people and injured some 170”, ’contradiction’, assuming scalar implicature and event coreference. Our perturbed test set contains 261 pairs. On this set, OpenAI^{15}^{15}15the bestperforming neural model on NewsNLI achieves an accuracy of 32.33%, as compared to 71.26% on the unperturbed set. This suggests that the model relies on verbal reasoning rather than numerical reasoning. In comparison, QREAS achieves an accuracy of 67.2% on the perturbed set, compared to 55.93% on the unperturbed set, highlighting reliance on quantities rather than verbal information. Closer examination reveals that OpenAI switches to predicting the ‘neutral’ category for perturbed samples instead of entailment, accounting for 51.7% of it’s errors, possibly symptomatic of lexical bias issues Naik et al. (2018).

What Quantitative Phenomena Are Hard? We sample 100 errors made by QREAS on each test in EQUATE (Table 14), to identify phenomena not addressed by simple quantity comparison. On natural datasets containing sentences with complex linguistic structure, the segmenter and parser cause most errors (66% on average), indicating that identifying quantities, or parsing them into a representation is more difficult in these datasets. Conversely, the composition module has a higher error rate on synthetic data (24.5%) than natural (4.7%). Our analysis of causes for error suggest avenues for future research:

Incorporating real world knowledge: Lack of real world knowledge causes errors in identifying quantities and valid comparisons. Errors include inability to map abbreviations to correct units (eq: “m” to “meters”), to detect partwhole coreference (eg: “seats” can be used to refer to “buses”), or correctly resolve hypernymy/hyponymy (eg: “young men” to “boys”).

Inferring underspecified quantities: Quantity attributes can be implicitly specified, requiring inference to generate a complete representation. Consider “A mortar attack killed four people and injured 80”. A system must infer that the quantity “80” refers to people. On RTEQuant, 20% of such cases stem from zero anaphora, a hard problem even in coreference resolution.

Arithmetic comparison limitations: These examples require composition between incompatible quantities. For example, consider “There were 3 birds and 6 nests”, “There were 3 more nests than birds”. To correctly label this pair “3 birds” and “6 nests” must be composed.

Integrating verbal reasoning: No model integrates complex verbal and quantitative reasoning. For example, consider the pair “Two people were injured in the attack”, “Two people perpetrated the attack”. Quantities “two people” and “two people” are unitcompatible, but must not be compared. Numbers and language are intricately interleaved and developing a reasoner capable of handling this complex interplay is challenging.

6 Conclusion
In this work, we present EQUATE, an evaluation framework to estimate the ability of models to reason quantitatively in textual entailment. We observe that existing neural approaches rely on the verbal reasoning aspect of the task to succeed rather than reasoning about quantities. We also present
QREAS , a baseline that symbolically reasons about quantities and while it achieves some success at numerical reasoning, it lacks sophisticated verbal reasoning capabilities, indicating the complexity of inference. We believe that a promising avenue to explore is combining the strengths of neural models and specialized reasoners in hybrid architectures to be more effective, though it remains unclear how this can be achieved. In the future, we hope our insights, and the EQUATE evaluation framework, lead to the development of models that can more precisely reason about quantities in natural language.Acknowledgments
This work has partially been supported by the National Science Foundation under Grant No. CNS 1330596. The data collection process has been funded by the Fellowship in Digital Health from the Center of Machine Learning and Health at Carnegie Mellon University. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the NSF, or the US Government. The authors would like to thank Graham Neubig for helpful discussion regarding this work, and Shruti Rijhwani and Siddharth Dalmia for reviews while drafting this paper.
Literatur

Angeli and Manning (2014)
Gabor Angeli and Christopher D Manning. 2014.
Naturalli: Natural logic inference for common sense reasoning.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pages 534–545. 
Balazs et al. (2017)
Jorge Balazs, Edison MarreseTaylor, Pablo Loyola, and Yutaka Matsuo. 2017.
Refining raw
sentence representations for textual entailment recognition via attention.
In
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP
, pages 51–55, Copenhagen, Denmark. Association for Computational Linguistics. 
Barwise and Cooper (1981)
Jon Barwise and Robin Cooper. 1981.
Generalized quantifiers and natural language.
In
Philosophy, Language, and Artificial Intelligence
, pages 241–301. Springer.  Bos and Markert (2005) Johan Bos and Katja Markert. 2005. Recognising textual entailment with logical inference. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 628–635. Association for Computational Linguistics.
 Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
 Chen et al. (2017a) Qian Chen, Xiaodan Zhu, ZhenHua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017a. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668, Vancouver, Canada. Association for Computational Linguistics.
 Chen et al. (2017b) Qian Chen, Xiaodan Zhu, ZhenHua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017b. Recurrent neural networkbased sentence encoder with gated attention for natural language inference. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, pages 36–40, Copenhagen, Denmark. Association for Computational Linguistics.
 Clark (2018) Peter Clark. 2018. What knowledge is needed to solve the rte5 textual entailment challenge? arXiv preprint arXiv:1806.03561.
 Condoravdi et al. (2003) Cleo Condoravdi, Dick Crouch, Valeria De Paiva, Reinhard Stolle, and Daniel G Bobrow. 2003. Entailment, intensionality and text understanding. In Proceedings of the HLTNAACL 2003 workshop on Text meaningVolume 9, pages 38–45. Association for Computational Linguistics.
 Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
 Cooper et al. (1996) Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. Technical report.
 Dagan et al. (2010) Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth. 2010. The fourth pascal recognizing textual entailment challenge. Journal of Natural Language Engineering.
 Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pages 177–190. Springer.
 Dehaene (2011) Stanislas Dehaene. 2011. The number sense: How the mind creates mathematics. OUP USA.
 Ekstrom et al. (1976) Ruth B Ekstrom, Diran Dermen, and Harry Horace Harman. 1976. Manual for kit of factorreferenced cognitive tests, volume 102. Educational Testing Service Princeton, NJ.
 Frank et al. (2008) Michael C Frank, Daniel L Everett, Evelina Fedorenko, and Edward Gibson. 2008. Number as a cognitive technology: Evidence from pirahã language and cognition. Cognition, 108(3):819–824.
 (17) Yaroslav Fyodorov. A natural logic inference system. Citeseer.
 Giampiccolo et al. (2007) Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third pascal recognizing textual entailment challenge. In Proceedings of the ACLPASCAL workshop on textual entailment and paraphrasing, pages 1–9. Association for Computational Linguistics.
 Glickman et al. (2005) Oren Glickman, Ido Dagan, and Moshe Koppel. 2005. Web based probabilistic textual entailment.
 Glockner et al. (2018) Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
 Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324.
 Haghighi et al. (2005) Aria Haghighi, Andrew Ng, and Christopher Manning. 2005. Robust textual inference via graph matching. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing.
 Harabagiu and Hickl (2006) Sanda Harabagiu and Andrew Hickl. 2006. Methods for using textual entailment in opendomain question answering. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 905–912. Association for Computational Linguistics.
 Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
 Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533.
 Huang et al. (2017) Danqing Huang, Shuming Shi, ChinYew Lin, and Jian Yin. 2017. Learning finegrained expressions to solve math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 805–814.
 Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Scitail: A textual entailment dataset from science question answering.
 KoncelKedziorski et al. (2015) Rik KoncelKedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
 Kushman et al. (2014a) Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014a. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 271–281.
 Kushman et al. (2014b) Nate Kushman, Luke S. Zettlemoyer, Regina Barzilay, and Yoav Artzi. 2014b. Learning to automatically solve algebra word problems. In ACL.
 Levinson (2001) Stephen C Levinson. 2001. Pragmatics. In International Encyclopedia of Social and Behavioral Sciences: Vol. 17, pages 11948–11954. Pergamon.
 Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
 MacCartney (2009) Bill MacCartney. 2009. Natural language inference. Stanford University.
 Malakasiotis and Androutsopoulos (2007) Prodromos Malakasiotis and Ion Androutsopoulos. 2007. Learning textual entailment using svms and string similarity measures. In Proceedings of the ACLPASCAL Workshop on Textual Entailment and Paraphrasing, pages 42–47. Association for Computational Linguistics.
 Marelli et al. (2014) Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. 2014. A sick cure for the evaluation of compositional distributional semantic models.

de Marneffe et al. (2009)
MarieCatherine de Marneffe, Sebastian Padó, and Christopher D Manning. 2009.
Multiword expressions in textual inference: Much ado about nothing? In Proceedings of the 2009 Workshop on Applied Textual Inference, pages 1–9. Association for Computational Linguistics.  Marneffe et al. (2008) MarieCatherine Marneffe, Anna N Rafferty, and Christopher D Manning. 2008. Finding contradictions in text. Proceedings of ACL08: HLT, pages 1039–1047.
 Mitra and Baral (2016) Arindam Mitra and Chitta Baral. 2016. Learning to use formulas to solve simple arithmetic problems. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2144–2153.
 Naik et al. (2018) Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
 Nangia et al. (2017) Nikita Nangia, Adina Williams, Angeliki Lazaridou, and Samuel Bowman. 2017. The repeval 2017 shared task: Multigenre natural language inference with sentence representations. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, pages 1–10, Copenhagen, Denmark. Association for Computational Linguistics.
 Nie and Bansal (2017) Yixin Nie and Mohit Bansal. 2017. Shortcutstacked sentence encoders for multidomain inference. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, pages 41–45, Copenhagen, Denmark. Association for Computational Linguistics.
 Parikh et al. (2016) Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
 Poliak et al. (2018) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042.
 Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pretraining.
 Romano et al. (2006) Lorenza Romano, Milen Kouylekov, Idan Szpektor, Ido Dagan, and Alberto Lavelli. 2006. Investigating a generic paraphrasebased approach for relation extraction.
 Roy (2017) Subhro Roy. 2017. Reasoning about quantities in natural language. Ph.D. thesis, University of Illinois at UrbanaChampaign.
 Roy and Roth (2016) Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
 Sammons et al. (2010) Mark Sammons, VG Vydiswaran, and Dan Roth. 2010. Ask not what textual entailment can do for you… In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1199–1208. Association for Computational Linguistics.
 Stafford (1972) Richard E Stafford. 1972. Hereditary and environmental components of quantitative reasoning. Review of Educational Research, 42(2):183–201.
 Upadhyay et al. (2016) Shyam Upadhyay, MingWei Chang, KaiWei Chang, and Wentau Yih. 2016. Learning from explicit and implicit supervision jointly for algebra word problems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 297–306.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
 Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broadcoverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
 Zanzotto et al. (2006) F Zanzotto, Alessandro Moschitti, Marco Pennacchiotti, and M Pazienza. 2006. Learning textual entailment from examples. In Second PASCAL recognizing textual entailment challenge, page 50. PASCAL.
 Zhou et al. (2015) Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 817–822.
7 Supplemental Material
Baseline implementation performances on MultiNLIDev Matched. All reimplementations closely match performance reported in the original publications.
Model  MultiNLI Dev 

Hyp Only  53.18% 
ALIGN  45.0% 
CBOW  63.5% 
BiLSTM  70.2% 
Chen  73.7% 
NB  74.2% 
InferSent  70.3% 
ESIM  76.2% 
OpenAI Transformer  81.35% 