Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills

07/15/2021 ∙ by Ori Yoran, et al. ∙ Tel Aviv University 0

Models pre-trained with a language modeling objective possess ample world knowledge and language skills, but are known to struggle in tasks that require reasoning. In this work, we propose to leverage semi-structured tables, and automatically generate at scale question-paragraph pairs, where answering the question requires reasoning over multiple facts in the paragraph. We add a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills such as number comparison, conjunction, and fact composition. To improve data efficiency, we propose sampling strategies that focus training on reasoning skills the model is currently lacking. We evaluate our approach on three reading comprehension datasets that are focused on reasoning, and show that our model, PReasM, substantially outperforms T5, a popular pre-trained encoder-decoder model. Moreover, sampling examples based on current model errors leads to faster training and higher overall performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An example table and question-context-answer triplets generated from the table as synthetic data. Each color corresponds to a different reasoning skill and colored cells are necessary to answer the question. The contexts shown are partial, such that the actual context contains the necessary information to answer the question and additional distractors. Answers are not necessarily extractive (e.g., date difference).

Large pre-trained language models (LMs) Devlin et al. (2019); Liu et al. (2019); Brown et al. (2020); Raffel et al. (2020)

have become the backbone of natural language processing in recent years. However, recent work has shown that they struggle in performing symbolic reasoning operations, such as composition or conjunction of facts

Talmor et al. (2019, 2020), numerical operations Wallace et al. (2019); Hidey et al. (2020), and quantification Warstadt et al. (2019), without substantial amounts of additional data.

Past work on improving reasoning skills in pre-trained models has taken two flavors: (a) adding specialized components for specific skills, like numerical and temporal reasoning Ran et al. (2019); Gupta et al. (2020a); Khot et al. (2021); Chen et al. (2020a), or (b) generating synthetic examples at scale, for example, by using grammars and templates Rozen et al. (2019); Zhao et al. (2019); Andreas (2020); Asai and Hajishirzi (2020); Campagna et al. (2020); Geva et al. (2020), and question generation models Alberti et al. (2019); Puri et al. (2020); Bartolo et al. (2021).

Figure 2: Approach overview. First, we use semi-structured tables to generate large amounts of data from 16 different example generators (EGs), each corresponding to a different reasoning skill. Then, a pre-trained LM is trained over this data in a multi-task setup to obtain our model, PReasM, where we dynamically sample examples based on current model errors (arrow width corresponds to the number of sampled examples). Last, our model is fine-tuned and evaluated on target tasks that require reasoning.

In this work, we take the latter approach and argue that semi-structured tables are a valuable resource for automatic generation of training data that will endow LMs with reasoning skills. Tables can be crawled from the web at scale, and cover a wide range of domains and topics. Moreover, their structured nature makes them amenable to automatic processes of data generation. Specifically, given a table, we use templates to generate reading comprehension (RC) examples, that is, question-context-answer triplets, where answering the question requires diverse types of reasoning over facts mentioned in the context. Fig. 1 shows an example table, and three generated question-context-answer examples, which require fact composition, number comparison, and computing a date difference. Unlike prior work where semi-structured data was used for reasoning over tables or knowledge-bases Eisenschlos et al. (2020); Yin et al. (2020); Herzig et al. (2020); Yu et al. (2021), here we harness tables to allow LMs to reason over text directly.

Fig. 2 provides an overview of our approach. We generate data by crawling tables from Wikipedia, and applying 16 different example generators (EGs) on each table. Each EG corresponds to a particular reasoning skill (composition, numerical comparison, see Table 1 for full list), and comprises a small set of question templates. Variables in the templates are filled with content from the table, and the structure of the table allows to compute the answer automatically. The context is a list of facts generated from the table that contain facts required for answering the question as well as distractor facts.

We add a pre-training step over this generated data, where we perform multi-task training over the 16 task corresponding to the EGs. Since each EG can generate vast numbers of examples, it is important to focus training on reasoning skills that the model lacks. Thus, we use error-driven sampling to construct training batches, where most examples are sampled from EGs that the model currently struggles with. We experiment with error sampling Gottumukkala et al. (2020), where examples are sampled in proportion to the current error rate on a particular task, and propose momentum sampling, where examples are sampled in proportion to how fast the model is improving on a task. We show that when some tasks are very noisy, momentum sampling is more data efficient than error sampling.

We fine tune our Pre-traind for Reasoning Model, PReasM, on three RC datasets that require reasoning: DROP Dua et al. (2019), IIRC Ferguson et al. (2020), and MMQA Talmor et al. (2021). PReasM outperforms the original pre-trained T5 Raffel et al. (2020) model by significant margins: , , and F points, respectively. Our results set a new state-of-the-art on MMQA and are the best results on IIRC for models where the retriever and reader are trained separately. Our analysis shows that PReasM leads to improvements of up to F points on specific question types, such as computing the difference between two dates, without causing a drop in other question types.

In conclusion, our results suggest that semi-structured tables are a viable and untapped source of information for automatically generating large amounts of data that can be used to endow LMs with reasoning skills that are not captured using current pre-training approaches.

Our code, data, and models are publicly available and can be downloaded from https://github.com/oriyor/turning_tables.

2 Data Generation

We succinctly define the problem setup, and then turn to the process of automatic data generation from tables.

Problem Setup

Our goal is to train a RC model that given a question and textual context returns an answer (Fig. 1), given a training set . We focus on questions that require reasoning over the context , e.g., composing two facts. To endow LMs with reasoning skills, we would like to automatically generate a large synthetic training set (where ) from semi-structured tables, before fine-tuning on a target dataset.

2.1 Generating Examples from Tables

We use tables from English Wikipedia111We use the 01-01-2020 Wikipedia dump. to generate . English Wikipedia includes millions of tables with high lexical and domain diversity Fetahu et al. (2019); Chen et al. (2020b); Gupta et al. (2020b); Talmor et al. (2021); Nan et al. (2021); Neeraja et al. (2021a). We first extract from Wikipedia all tables that have at least two columns and 10-25 rows, resulting in more than 700K tables. Then, we annotate all table columns with their semantic type (STRING, NUMBER, or DATE), which allows us to generate questions that involve manipulating numbers and dates. We annotate the semantic type of each column with standard tools for parsing dates222https://pypi.org/project/python-dateutil/ and numbers.

EG Template Question

2/3-hop
What was the col:1(s) when the col:2 was val:2 in “What was the Play(s) when the Author was William Shakespeare in Notable works
Composition table-title of page-title? of Lidia Zamkow?

Conjunction
What was the col:1 when the col:2 was val:2 and the col:3 “What was the common name when the family was Picidae and the distribution was
was val:3 in table-title of page-title? Okinawa in List of species of List of endemic birds of Japan?”

Quantifiers
Is val:1 the only col:1 that has col:2 val:2 in table-title “Is Jean Philippe the only Artist that has Language French in Results of Eurovision
Only of page-title? Song Contest 1959?”

Quantifiers
In table-title of page-title, does [OPERATOR] col:1 “In Coal of List of Mines in South Africa, does every Mine have Owner Exxaro?”
Every/Most have col:2 val:2?

Num.
In col:1 of table-title of page-title, which col:1 had “In Administration of Mueang Nonthaburi District, which name had a higher
Comparison [OPERATOR] col:2: val:2 or val:2? population: Suan Yai or Bang Khen?

Temp.
In col:1 of table-title of page-title, what happened “In Awards and nominations of Alexandre Pires, what happened earlier: the Category
Comparison [OPERATOR]: the col:1 was val:1 or the col:2 was val:2? was Pop New Artist or the Category was Album of the Year?

Num. Boolean
In col:1 of table-title of page-title did val:1 have “In Top employers of Chula Vista, California, did Walmart have more employees
Comparison [OPERATOR] col2 than val:1? than Target?”

Temp. Boolean
The col:1 was val:1 [OPERATOR] the col:2 was val:2 in “The Referee was Salim Oussassi more recently than when the Referee was Rachid
Comparison table-title of page-title? Medjiba in 1980 to 1999 of Algerian Cup Final referees?


Temp./Num.
In table-title of page-title, which col:1 has the “In List of graphic novels of Minx (comics), which title has the earliest release date?”
Superlatives [OPERATOR] col:2?

Arithmetic
In table-title of page-title, what was the [OPERATOR] “In By rocket of 1961 in spaceflight, what was the highest Successes when the Remarks
Superlatives col:1 when the col:2 was val:2? was Maiden flight?

Counting
How many col:1 have col:2 val:2 in table-title of “How many elections have candidate John Kufuor in Presidential elections of New
page-title? Patriotic Party?

Arithmetic
In table-title of page-title, what was the total number of “In Assists table of 2010-11 La Liga, what was the total number of assists when the
Addition col:1 when the col:2 was val2? club was Villarreal?

Date
In table-title of page-title, how much time had passed bet- “In Notable events | Concerts of Candlestick Park, how much time had passed
Difference ween when the col:1 was val:1 and when the col:2 was val:2? between when the Artist was Paul McCartney and when the Artist was The Beatles?

Table 1: Question templates with examples for all EGs. Variable names specify permissible instantiations, where col is a column name, val is a value, and indices denote that a value must originate from a particular column. 2/3-hop composition examples are generated by generating 2/3-long fact chains between the answer and the value in the question. For example, above, the chain will include the facts “The Role when the Author was Shakespeare was Lady Macbeth. The Play when the Role was Lady Macbeth was Macbeth”. ‘[OPERATOR]’ corresponds to EG-specific operators that we instantiate, e.g., in the EG ‘Temp. comparison’ [OPERATOR] is replaced with ‘earlier’ or ‘later’. Some EGs are collapsed into a single row (e.g., Quantifiers Every/Most).

The core of the generation process are the example generators (EGs), each corresponding to a particular reasoning skill, such as composition, conjunction, etc. (see Table 1). Each example generator is a function that takes a table and randomly samples ten triplets from the set of all possible triplets, where (i) is a question is pseudo-language, (ii) is the context, i.e., a list of facts extracted from that includes the gold facts necessary for answering and distractor facts, all phrased in pseudo-language, and (iii) is the answer. Overall, the synthetic training set is .

EGs generate examples in the following way. Each EG is associated with one or more question templates, which differ in their surface phrasing. Templates contain typed variables that are instantiated with content from the table (see all variables in Table 1). Column and value variables are indexed to specify that the variable val:i must be instantiated by a value from the column col:i. In temporal/numerical templates some column and value variables must have a DATE/NUMBER type, but we omit this notation for brevity. Instantiating all variables results in the question and the template allows us to programmatically compute the answer . E.g., in the question from Fig. 1: “In League Cup of 1990–91 Chelsea F.C. season, Which Round had a higher Attendance: QF or QFR?” the answer can be found by finding the rows with the values “QF” and “QFR” in the column “Round”, and returning the value that has a higher number in the column “Attendance”.

The context is generated from the table content necessary for answering the question, which can be identified using the instantiated question template. Facts generally have the form “The col:1 when the col:2 was val:2 was val:1”. For example, to answer the question above, we generate the gold facts “The Attendance when the Round was QF was 34,178” “The Attendance when the Round was QFR was 33,861” using the relevant column names and values. We also generate additional distractor facts by generating facts from rows or columns that are not relevant for the question, for example, “The Attendance when the Round was R4 was 9,789”. Finally, we shuffle the gold facts and distractor facts.

Overall, our process results in a large set , which includes examples that require reasoning from 16 EGs (all shown in Table 1).

2.2 Data Analysis

The data generation process yields questions from over tables, and their main statistics are in Table 2. The number of distinct words and word pieces is very large (850K and 27K respectively), illustrating the wide coverage and high lexical diversity of our approach. Moreover, generated examples have diverse answer types, which include extracting spans from the context, yes/no questions, numeric, and date answers. In addition, by leveraging the distribution of Wikipedia tables our questions cover a wide range of domains including popular culture, geography, politics and science. Specifically, tables cover more than 2,500 different Wikipedia categories, with 150 categories covering 80% of the data. We show the most frequent categories in §A.1.

3 Training

Since our EGs generate large quantities of examples, one can think of each EG as providing an infinite stream of examples. In this setup, a natural question is how to construct training batches such that the model learns the required skills as quickly as possible. After briefly describing our model, we will detail our training framework, where we sample examples from EGs in an error-driven manner.

Model

We use a standard encoder-decoder architecture Raffel et al. (2020); Lewis et al. (2020). Given a training example , the model takes as input the sequence of tokens ‘’, and the task is to autoregressively decode the answer token-by-token. We train to maximize the maximum likelihood objective .

3.1 Multi-task Training over Reasoning Skills

Given a pre-trained LM, we add another pre-training step, where we multi-task over a set of tasks , where each task corresponds to examples generated from a single EG. Similar to past work Yogatama et al. (2019); Geva et al. (2020), to avoid “catastrophic forgetting” Kirkpatrick et al. (2016)

of the language skills acquired during pre-training, we sample batches from the original pre-training task with probability

.

Past work Gottumukkala et al. (2020) has shown that heterogeneous batching

, i.e., having examples from all tasks in each batch, leads to better performance compared to having entire batches from a single task. We follow this practice, and in every batch sample examples from every task according to a probability distribution

. The main question is how to determine the distribution , which we turn to next.

Measurement Value
# Distinct Questions 4.8M
# Distinct tables 176K
# Distinct pages 130K
Avg. question length (words) 19.34.2
Avg. context length (words) 111.344.8
Avg. # of gold facts 4.44.7
Avg. # of distractors facts 5.02.8
# Distinct words 850,543
# Distinct word-pieces 27,055
% Span answer 43.2
% Yes/no answer 31.6
% Numeric answer 15.8
% Date answer 9.4
Table 2: Key statistics for the generated data.

3.2 Sampling Strategies

We describe strategies for computing , starting with the commonly-used uniform sampling approach, and then turn to error-driven approaches.

Uniform sampling

Past work Khashabi et al. (2020); Raffel et al. (2020); Wang et al. (2020) used uniform sampling, where the probability to sample from a task is , as a-priori all tasks are equally important. Other approaches sample examples in proportion to the size of the training set Raffel et al. (2020); Wang et al. (2020). This is not applicable in our case, where we assume an infinite stream of examples for every task, and also make no assumptions on the distribution over reasoning skills in the downstream test set.

Error sampling

Recent work Sharma et al. (2018); Gottumukkala et al. (2020) has proposed to construct based on model errors, where one over-samples tasks where the error rate is high. More formally, let

be an estimate of the maximum accuracy achievable on a task

, and be the current model accuracy for task on an held-out set. We define and . The distribution is updated every time we evaluate the current model on the held-out data. In our setup, since we perform multi-task training over synthetic examples that are generated at scale, we assume that and hence: .

Momentum Sampling

An issue with error sampling is that if the error rate is high for a task and learning it is slow, the model will spend most time on that task at the expense of all other tasks, which may lead overall to low data efficiency. We empirically demonstrate this phenomenon at the bottom of this section. To remedy this phenomenon, we introduce a new sampling strategy, termed momentum sampling.

In momentum sampling, we sample examples from a task in proportion to its rate of improvement, putting most probability mass on skills that are improving quickly. Alg. 1 provides the details of this strategy. Let denote the index of a checkpoint evaluated on the held-out set, let be a window size, and be the held-out accuracy of checkpoint on task . We estimate model accuracy on a task at the beginning and end of the window, and sample examples in proportion to the difference333We use the difference in performance and not the gain to account for cases of sudden drops in performance. in accuracy during that window. To smooth out accuracy fluctuations in adjacent checkpoints, we estimate accuracy as an average of model checkpoints. During the first checkpoint evaluations, we simply use uniform sampling.

A desired property of momentum sampling is that when all tasks reach their ceiling accuracy, it converges to uniform sampling, unlike error sampling that will over-sample from tasks for which

is low. This is beneficial in cases where there is variance in

across tasks. We illustrate this point empirically next.

Input: windows size , smoothing factor , minimum share of examples per task , training time .

1:for  do
2:     if  then
3:          
4:          
5:          
6:     else
7:                
8:
Algorithm 1 Momentum Sampling(, )

Empirical comparison of sampling strategies

To highlight the benefits of momentum sampling, we show that when sampling from two tasks, where the labels for one of the tasks are noisy, momentum sampling outperforms error sampling. Specifically, we consider training on 2-hop composition and arithmetic addition, which is slower to train, in two conditions: (a) with gold labels for both tasks, and (b) when the labels for the 2-hop composition are randomly sampled from the vocabulary. We expect that when labels for 2-hop composition are random, this will lead to slow training of arithmetic addition when using error sampling, since most of the probability mass will be dedicated to 2-hop composition, which is impossible to learn.

Fig. 3 illustrates this phenomenon. Without noise (left), both momentum sampling and error sampling learn faster than uniform sampling. Momentum sampling learns more slowly than error sampling, due to the warm-start period in the first evaluated checkpoints. However, when 2-hop composition has random labels (right), error sampling puts most probability mass on 2-hop composition, and thus error sampling is even worse than uniform sampling, while momentum sampling performs best. Thus, momentum sampling outperforms uniform sampling in both cases.

Related work

Past work has considered error-driven data sampling in the context of active learning

Sharma et al. (2018)

, reinforcement learning

Graves et al. (2017); Glover and Hokamp (2019); Xu et al. (2019)

, transfer learning

Zhang et al. (2020); Pilault et al. (2021), and distributionally robust optimization Oren et al. (2019); Sagawa et al. (2020), where the goal is to perform well over a family of distributions over the tasks. Similar to gottumukkala-etal-2020-dynamic, we compute based on accuracy over a held-out set rather than the loss over a training data, as this corresponds directly to our target metric.

Figure 3: Motivation for momentum sampling. With the gold labels (left), error sampling and momentum sampling outperform uniform sampling on the arithmetic addition task by over-sampling the harder task. When 2-hop composition has random labels (right), error sampling over-samples the composition task and momentum sampling is best.

4 Experimental Setup

We now describe our experimental evaluation.

4.1 Models

Baselines

Our baseline is T5 Raffel et al. (2020), a popular pre-trained encoder-decoder model, which we fine-tune on the downstream datasets. We experiment with two model sizes, 220 million parameters (T5-Base), and 770 million parameters (T5-Large). When the answer is a list, we train our model to generate the list of values.

Our pre-trained for reasoning model, PReasM, is a T5 model where we add a second step of pre-training on . Again, we experiment with Base and Large models and three sampling strategies: uniform sampling, error sampling, and momentum sampling; we name our models PReasM-Uni, PReasM-Err, and

PReasM-Moment

accordingly.

4.2 Datasets

Drop

Dua et al. (2019) is a RC dataset with questions that require mathematical reasoning. As an additional baseline, we also compare to GenBERT Geva et al. (2020), which similar to our approach injects numerical skills by automatically generating synthetic data from a grammar.

Iirc

Ferguson et al. (2020) is a question answering dataset, where annotators were given a single Wikipedia paragraph, and were asked to author questions that depend on that paragraph, but also on other paragraphs linked from the input paragraph, without observing the said paragraphs. This resulted in questions that require discrete temporal () or numeric () reasoning. In addition, of the questions are unanswerable.

We experiment with IIRC in both an oracle and retrieval setting. In the oracle setting, the model is given the gold context, which reduces the problem to reading comprehension, where we can apply our models. In the retrieval setting, we use the improved pipeline model introduced by DBLP:journals/corr/abs-2103-12235 to retrieve the relevant context, and then replace the NumNet+ (Base) reader Ran et al. (2019) used by the authors (which has specialized architecture for numerical reasoning) with T5/PReasM.

Mmqa

Talmor et al. (2021) is a question answering dataset, where the input is a question and a context that consists of a table, multiple text paragraphs, and multiple images, and the model must reason over a subset of the input modalities to answer the question.444We removed tables that appear in the MMQA development and test sets from . Since our T5/PReasM models cannot handle images or very long contexts, we construct a pipeline that automatically directs some MMQA questions to T5/PReasM, and uses the original Implicit-Decomp baseline from talmor2021multimodalqa elsewhere.

The first classifier in this pipeline is a T5-large model fine-tuned on the MMQA training set to determine if a question is likely to require an image or not. When the classifier determines a question requires an image, the example is directed to

Implicit-Decomp. The accuracy of this classifier on the MMQA development set is %.

The second classifier in the pipeline is a T5-3B model, fine-tuned on the MMQA training set to determine given a question and one of the textual paragraphs if that paragraph is required for answering the question. Then, for every question that does not require an image, we classify each of the textual paragraphs and only use the ones classified as relevant. This process identifies all gold paragraphs in % of the examples. Again, we experiment with an oracle and retrieval setting, such that in the oracle setting our model is presented with the gold text paragraphs.

Last, we convert the table into text by linearizing the table as described in talmor2021multimodalqa. The model is presented with multiple paragraphs and the linearized table, and can answer questions that require any reasoning across them. Since the context is long, we present the model with contexts of size 1,536 word-pieces (without any change to the original T5 model).

Table 3 contains the number of questions in the train, development, and test sets for each of our datasets. For MMQA, there are 15,688 train and 1,501 development examples that require reasoning over the table and text only.

Dataset # Train # Development # Test
Questions Questions Questions
DROP 77,409 9,536 9,622
IIRC 10,839 1,301 1,301
MMQA 23,817 2,441 3,660
Table 3: Number of questions in each dataset.

Evaluation metrics

For all datasets, we use the F and EM scores defined in DROP Dua et al. (2019), and later used in IIRC and MMQA, where given a gold and predicted list of answers, items on the two lists are aligned, and then their strings are compared. We use the official evaluation scripts released by the dataset authors in all cases.

5 Experimental Results

We first present results on the downstream RC datasets (§5.1) and then examine performance directly on the synthetic data (§5.2).

5.1 Performance on RC Datasets

Table 4 presents the results of our large models over all datasets, also in comparison to current state-of-the-art. We observe that PReasM substantially improves performance compared to T5 in all conditions, improving on the test set by , , , and F points on DROP, IIRC, IIRC , and MMQA respectively. We obtain new state-of-the-art results on MMQA and IIRC. On IIRC, we improve performance when using the same retriever (Pipeline) and replacing the NumNet+ retriever with PReasM.555We report the official numbers from DBLP:journals/corr/abs-2103-12235 (/ F on the development/test sets). To fairly compare with the NumNet+ reader, we got the retrieved paragraphs for the Pipeline model through personal communication. However, results on these paragraphs was lower than reported in the paper: / F. The reported results of our models are with this slightly worse retriever, which still outperforms the performance of NumNet+ (Pipeline) as reported in the original paper. On DROP, specialized architectures for handling numbers still substantially outperform both T5 and PReasM.

Table 5 shows the effect of different sampling strategies when training the PReasM model. We observe that error sampling and momentum sampling generally outperform uniform sampling, but do not observe a clear advantage to momentum sampling compared to error sampling. We further analyze the effects of momentum sampling and error sampling when pre-training on in §5.2.

We now look in detail at the performance of models on different answer types across datasets, where we observe that PReasM leads to dramatic improvements on some types, while maintaining similar performance on other types.

Dataset Model Development Test
DROP T5-Large 64.6/61.8 65.0/61.8
PReasM-Large 72.3/69.4 72.6/69.5
2-4 GenBERT 72.3/68.8 72.4/68.6
QDGAT-ALBERT 90.1/87.0
IIRC T5-Large 69.9/64.9 67.1/62.7
PReasM-Large 77.4/72.7 75.0/70.6
2-4 NumNet+ 69.2/63.9 70.3/65.6
IIRC T5-Large (Pipeline) 47.4/44.2 41.0/37.8
PReasM-Large (Pipeline) 50.0/46.5 45.1/42.0
2-4 NumNet+ (Pipeline) 45.8/41.7 44.3/41.3
NumNet+ (Joint) 50.6/46.9 50.5/47.4
MMQA T5-Large 64.3/57.9 63.4/57.0
PReasM-Large 65.5/59.0 64.6/58.3
2-4 Implicit-Decomp 55.5/48.8 55.9/49.3
Table 4: Development and test results. The two values in each cell indicate F/EM.
Model DROP IIRC IIRC MMQA
T5-Base 55.4 65.8 43.4 61.4
PReasM-Uni-Base 67.4 72.9 47.3 62.7
PReasM-Moment-Base 67.7 73.4 46.8 62.5
PReasM-Err-Base 68.0 74.3 47.0 62.6
T5-Large 64.6 69.7 47.0 64.2
PReasM-Uni-Large 71.4 75.0 49.1 64.9
PReasM-Moment-Large 71.7 76.8 49.8 64.9
PReasM-Err-Large 72.2 76.1 49.2 65.3
Table 5: F

on the development set with different sampling strategies. Results show the average and standard deviation over 3 seeds.

Drop

PReasM outperforms T5 by points in the Base model and by points in the Large model (see Table 5). Table 6 breaks down performance based on answer types, where we see that PReasM outperforms T5 across the board for all model sizes and answer types by a large margin.

Comparing PReasM-Base to GenBERT, which is a Base size model, we find PReasM outperforms GenBERT, on 3 of the 4 answer types. The high performance of GenBERT on Number questions can be explained by several factors: (a) GenBERT uses digit tokenization which improves arithmetic reasoning Thawani et al. (2021), and (b) training on multiple templates that focus on numerical reasoning. Training PReasM on more numeric data generated from the grammar of GenBERT is likely to lead to further improvements.

Model Span Spans Date Number Total
T5-Base 77.5 65.8 57.1 43.7 55.8
PReasM-Base 81.1 69.4 64.6 61.5 68.1
T5-Large 86.1 78.4 75.7 52.2 64.6
PReasM-Large 86.6 78.4 77.7 64.4 72.3
GenBERT 74.5 24.2 56.4 75.2 72.3
Table 6: Development F on DROP with answer type breakdown.

Iirc

Table 7 breaks down performance based on answer types. Again, PReasM outperforms T5 in the oracle setup by roughly points for both Base and Large models, and by - points in the retrieval setup. Improvements are mostly due to cases when the answer is a numerical Value, where PReasM outperforms T5 by and F points in Base and Large models (oracle setup).

Comparing PReasM-Base to NumNet+, we find that PReasM outperforms NumNet+ on None, Span and Binary questions, but has lower performance on Value questions, where NumNet+ uses specialized architecture.

Uniform sampling slightly outperforms error-driven sampling in the Base model on IIRC (Table 5). Analyzing answer types, we find that error-driven sampling improves performance on Value questions, but reduces performance on None questions, leading overall to a slight advantage for uniform sampling. This effect disappears in Large models, where error-driven sampling outperforms uniform sampling.

Overall, PReasM-Large improves state-of-the-art in the oracle setup by F points. In the retrieval setting, PReasM outperforms NumNet+ (Pipeline) by and F points on the development and test sets, respectively.

Model Oracle None Span Binary Value Total
T5-Base 91.4 72.0 76.6 8.7 66.3
PReasM-Base 92.5 74.9 71.9 47.8 74.5
T5-Large 92.2 77.7 81.3 10.9 69.9
PReasM-Large 92.2 78.4 80.5 51.2 77.4
T5-Base 57.1 47.6 54.7 6.7 43.5
PReasM-Base 53.9 49.1 64.8 24.3 47.5
T5-Large 56.2 49.9 77.3 11.5 47.4
PReasM-Large 55.9 50.8 69.5 28.6 50.0
NumNet+ (Pipeline) 49.6 48.4 52.3 30.0 45.8
Table 7: Development F on IIRC with answer type breakdown.

Mmqa

Model Oracle ColumnHop Text Composition Comparison Conjunction Yes/No Aggregate Total
T5-Base 81.7 75.2 67.0 61.8 74.1 76.9 27.3 71.9
PReasM-Base 80.8 75.7 66.3 80.8 80.8 83.1 36.4 74.3
T5-Large 82.6 79.8 71.8 69.3 83.0 83.1 27.3 76.8
PReasM-Large 84.0 79.7 71.9 81.0 82.3 93.8 36.4 78.4
T5-Base 85.2 82.1 74.6 63.3 77.4 80.0 27.3 77.9
PReasM-Base 86.9 80.0 75.4 84.1 82.6 89.2 36.4 79.9
T5-Large 88.2 85.9 79.4 74.1 83.2 83.1 36.4 82.7
PReasM-Large 87.8 85.6 79.8 83.6 82.3 90.8 45.5 83.8
Implicit-Decomp 96.6 57.1 53.2 78.4 68.1 76.9 59.1 62.3
Table 8: Development F on MMQA with reasoning type breakdown on the development set.

Table 8 breaks down model performance based on reasoning skill, which is annotated for every example in MMQA. PReasM outperforms T5 in both the oracle and retrieval setting, and for both model sizes.

We observe that the main cause of improvement are comparison questions, where PReasM outperforms T5 by and F on Base and Large models. Second, PReasM outperforms T5 on questions that require conjunction in Base models, and yes/no questions in all settings. Interestingly, T5 is equipped with decent composition skills, without any specialized pre-training and based only on the original T5 pre-training.

Comparing our models to Implicit-Decomp, we find that although Implicit-Decomp outperforms our models on questions that require hopping between two table columns and performing aggregations (there are only 11 aggregation questions in the development set), PReasM outperforms Implicit-Decomp in all other cases. When considering only questions that require reasoning over text and tables, PReasM-Large improves F by points, from to .

5.2 Perofrmance on

Figure 4: Minimum and average task accuracy over a held-out set from (left and center), and the entropy of (right), for Large models, as a function of the number of training steps for all sampling strategies.

Fig. 4 shows statistics on the performance of PReasM on different tasks in during training. The average accuracy across all 16 tasks at the end of training is high – almost 98.0 accuracy. We observe that PReasM reaches high performance on all tasks, where the lowest-performing tasks are ‘arithmetic addition’ and ‘date difference’, where the accuracy is at most and respectively at the end of training. On those tasks, the advantage of error-driven sampling is evident, and it outperforms uniform sampling by as much as 4 points. We provide full results over , including the performance of T5 in a few-shot setting in §A.4.

Zooming-in on the learning curve, we see that momentum and error sampling learn reasoning skills a lot faster than uniform sampling. Looking at the entropy of sheds light on the difference between error sampling and momentum sampling. Error sampling tends to put most probability mass on the lowest-performing task, namely arithmetic addition, and thus its entropy over tasks is roughly constant from a certain point in training. Conversely, momentum sampling puts a lot of probability mass on tasks that are improving quickly at the beginning of training, but as improvements plateau, it converges towards uniform sampling.

5.3 Analyzing Reasoning Skills in DROP

Question Type NMN T5- PReasM- T5- PReasM-
Base Base Large Large
Date-Compare 82.6 86.4 87.5 87.6 89.9
Date-Difference 75.4 19.6 78.9 45.4 80.4
Number-Compare 92.7 91.3 95.2 97.3 98.5
Extract-Number 86.1 91.8 94.9 92.1 95.1
Count 55.7 80.1 86.7 86.7 89.2
Extract-Argument 69.7 87.6 86.2 90.5 92.1
Table 9: F on a previously-proposed split of a subset of the development set of DROP to reasoning skills.

To check which reasoning skills PReasM has, we use a proposed split of a subset of DROP to reasoning skills Gupta et al. (2020a). Table 9 presents the F results for our best PReasM and T5 models on this split, as well as the F results from the neural module network (NMN) used in Gupta2020Neural. We note that NMN were trained only on a subset of the original DROP dataset. When comparing to T5, we find that PReasM dramatically improves performance Date-Difference, and also leads to sizable gains in Number-Compare, Extract-Number and Count. In addition, PReasM outperforms NMN on all reasoning skills.

6 Related Work

Data augmenation

Data augmentation techniques have been extensively explored in reading comprehension, question answering, and dialogue Feng et al. (2021), mainly by transfer learning Talmor and Berant (2019); Khashabi et al. (2020) and synthetic data generation Yu et al. (2018); Zhao et al. (2019); Alberti et al. (2019); Rozen et al. (2019); Campagna et al. (2020); Chen et al. (2020c); Asai and Hajishirzi (2020); Andreas (2020); Puri et al. (2020); Asai et al. (2020); Geva et al. (2020); Yang et al. (2021); Bartolo et al. (2021). Here we focus on semi-structured data as a valuable resource for data generation.

Pre-training over semi-structured data

Past work on pre-training over tables focused on reasoning over tables and knowledge-bases Eisenschlos et al. (2020); Yin et al. (2020); Herzig et al. (2020); Müller et al. (2021); Yu et al. (2021); Neeraja et al. (2021b)

, while we focus on reasoning over text. Recently, DBLP:journals/corr/abs-2106-01074 introduced a new dataset that focuses on reasoning over synthetic textual facts, which are generated by a LM from a knowledge graph.

7 Conclusion

In this work, we propose semi-structured tables as a valuable resource for automatically generating at scale examples that can endow pre-trained language models with reasoning skills. We generate almost 5M examples that correspond to 16 reasoning skills from Wikipedia tables and add a second pre-training step over this data. To improve data efficiency we use error-driven sampling, which focuses training on reasoning skills that the model currently lacks.

We evaluate our model, PReasM, on three reasoning-focused RC datasets and show that it leads to substantial improvements in all cases. Moreover, we thoroughly analyze the performance of PReasM and show that our approach dramatically improves performance on questions that require reasoning skills that were not acquired during the original pre-training, while maintaining comparable performance on other question types.

Acknowledgments

We thank Elad Segal and Ankit Gupta for their useful comments and James Ferguson, Ansong Ni and Matt Gardner for their help with the IIRC dataset. This research was partially supported by The Yandex Initiative for Machine Learning, and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800).

References

  • C. Alberti, D. Andor, E. Pitler, J. Devlin, and M. Collins (2019) Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6168–6173. External Links: Document, Link Cited by: §1, §6.
  • J. Andreas (2020) Good-enough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7556–7566. External Links: Document, Link Cited by: §1, §6.
  • A. Asai and H. Hajishirzi (2020) Logic-guided data augmentation and regularization for consistent question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5642–5650. External Links: Document, Link Cited by: §1, §6.
  • A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong (2020) Learning to retrieve reasoning paths over wikipedia graph for question answering. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §6.
  • M. Bartolo, T. Thrush, R. Jia, S. Riedel, P. Stenetorp, and D. Kiela (2021) Improving question answering model robustness with synthetic adversarial data generation. External Links: 2104.08678 Cited by: §1, §6.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1.
  • G. Campagna, A. Foryciarz, M. Moradshahi, and M. Lam (2020) Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 122–132. External Links: Document, Link Cited by: §1, §6.
  • K. Chen, W. Xu, X. Cheng, Z. Xiaochuan, Y. Zhang, L. Song, T. Wang, Y. Qi, and W. Chu (2020a) Question directed graph attention network for numerical reasoning over text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6759–6768. External Links: Document, Link Cited by: §1.
  • W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang (2020b) TabFact: A large-scale dataset for table-based fact verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §2.1.
  • X. Chen, C. Liang, A. W. Yu, D. Zhou, D. Song, and Q. V. Le (2020c) Neural symbolic reader: scalable integration of distributed and symbolic representations for reading comprehension. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §1.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2368–2378. External Links: Document, Link Cited by: §1, §4.2, §4.2.
  • J. Eisenschlos, S. Krichene, and T. Müller (2020) Understanding tables with intermediate pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 281–296. External Links: Document, Link Cited by: §1, §6.
  • S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy (2021) A survey of data augmentation approaches for NLP. External Links: 2105.03075 Cited by: §6.
  • J. Ferguson, M. Gardner, H. Hajishirzi, T. Khot, and P. Dasigi (2020) IIRC: a dataset of incomplete information reading comprehension questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1137–1147. External Links: Document, Link Cited by: §1, §4.2.
  • B. Fetahu, A. Anand, and M. Koutraki (2019) TableNet: an approach for determining fine-grained relations for wikipedia tables. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, and L. Zia (Eds.), pp. 2736–2742. External Links: Document, Link Cited by: §2.1.
  • M. Geva, A. Gupta, and J. Berant (2020) Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 946–958. External Links: Document, Link Cited by: §1, §3.1, §4.2, §6.
  • J. Glover and C. Hokamp (2019) Task selection policies for multitask learning. External Links: 1907.06214 Cited by: §3.2.
  • A. Gottumukkala, D. Dua, S. Singh, and M. Gardner (2020) Dynamic sampling strategies for multi-task reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 920–924. External Links: Document, Link Cited by: §1, §3.1, §3.2.
  • A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017)

    Automated curriculum learning for neural networks

    .
    In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1311–1320. External Links: Link Cited by: §3.2.
  • N. Gupta, K. Lin, D. Roth, S. Singh, and M. Gardner (2020a) Neural module networks for reasoning over text. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §5.3.
  • V. Gupta, M. Mehta, P. Nokhiz, and V. Srikumar (2020b) INFOTABS: inference on tables as semi-structured data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2309–2324. External Links: Document, Link Cited by: §2.1.
  • J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, and J. Eisenschlos (2020) TaPas: weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4320–4333. External Links: Document, Link Cited by: §1, §6.
  • C. Hidey, T. Chakrabarty, T. Alhindi, S. Varia, K. Krstovski, M. Diab, and S. Muresan (2020) DeSePtion: dual sequence prediction and adversarial examples for improved fact-checking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8593–8606. External Links: Document, Link Cited by: §1.
  • D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi (2020) UNIFIEDQA: crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1896–1907. External Links: Document, Link Cited by: §3.2, §6.
  • T. Khot, D. Khashabi, K. Richardson, P. Clark, and A. Sabharwal (2021) Text modular networks: learning to decompose tasks in the language of existing models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 1264–1279. External Links: Link Cited by: §1.
  • J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2016) Overcoming catastrophic forgetting in neural networks. CoRR abs/1612.00796. External Links: 1612.00796, Link Cited by: §A.3, §3.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Document, Link Cited by: §3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692 Cited by: §1.
  • T. Müller, J. M. Eisenschlos, and S. Krichene (2021) TAPAS at semeval-2021 task 9: reasoning over tables with intermediate pre-training. External Links: 2104.01099 Cited by: §6.
  • L. Nan, C. Hsieh, Z. Mao, X. V. Lin, N. Verma, R. Zhang, W. Kryscinski, N. Schoelkopf, R. Kong, X. Tang, M. Mutuma, B. Rosand, I. Trindade, R. Bandaru, J. Cunningham, C. Xiong, and D. R. Radev (2021) FeTaQA: free-form table question answering. CoRR abs/2104.00369. External Links: 2104.00369, Link Cited by: §2.1.
  • J. Neeraja, V. Gupta, and V. Srikumar (2021a) Incorporating external knowledge to enhance tabular reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 2799–2809. External Links: Link Cited by: §2.1.
  • J. Neeraja, V. Gupta, and V. Srikumar (2021b) Incorporating external knowledge to enhance tabular reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 2799–2809. External Links: Link Cited by: §6.
  • Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang (2019) Distributionally robust language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4227–4237. External Links: Document, Link Cited by: §3.2.
  • J. Pilault, A. E. hattami, and C. Pal (2021) Conditionally adaptive multi-task learning: improving transfer learning in NLP using fewer parameters & less data. In International Conference on Learning Representations, External Links: Link Cited by: §3.2.
  • R. Puri, R. Spring, M. Shoeybi, M. Patwary, and B. Catanzaro (2020) Training question answering models from synthetic data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 5811–5826. External Links: Document, Link Cited by: §1, §6.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §A.3, §1, §1, §3, §3.2, §4.1.
  • O. Ram, Y. Kirstain, J. Berant, A. Globerson, and O. Levy (2021) Few-shot question answering by pretraining span selection. In Association for Computational Linguistics (ACL), Cited by: Figure 6.
  • Q. Ran, Y. Lin, P. Li, J. Zhou, and Z. Liu (2019) NumNet: machine reading comprehension with numerical reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2474–2484. External Links: Document, Link Cited by: §1, §4.2.
  • O. Rozen, V. Shwartz, R. Aharoni, and I. Dagan (2019) Diversify your datasets: analyzing generalization via controlled variance in adversarial datasets. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 196–205. External Links: Document, Link Cited by: §1, §6.
  • S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang (2020) Distributionally robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §3.2.
  • S. Sharma, A. K. Jha, P. Hegde, and B. Ravindran (2018) Learning to multi-task by active sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3.2, §3.2.
  • A. Talmor and J. Berant (2019) MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4911–4921. External Links: Document, Link Cited by: §6.
  • A. Talmor, Y. Elazar, Y. Goldberg, and J. Berant (2020) OLMpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics 8, pp. 743–758. External Links: Document, Link Cited by: §1.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Document, Link Cited by: §1.
  • A. Talmor, O. Yoran, A. Catav, D. Lahav, Y. Wang, A. Asai, G. Ilharco, H. Hajishirzi, and J. Berant (2021) MultiModalQA: complex question answering over text, tables and images. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1, §4.2.
  • A. Thawani, J. Pujara, F. Ilievski, and P. Szekely (2021) Representing numbers in NLP: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 644–656. External Links: Link Cited by: §5.1.
  • E. Wallace, Y. Wang, S. Li, S. Singh, and M. Gardner (2019) Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5307–5315. External Links: Document, Link Cited by: §1.
  • X. Wang, Y. Tsvetkov, and G. Neubig (2020)

    Balancing training for multilingual neural machine translation

    .
    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8526–8537. External Links: Document, Link Cited by: §3.2.
  • A. Warstadt, Y. Cao, I. Grosu, W. Peng, H. Blix, Y. Nie, A. Alsop, S. Bordia, H. Liu, A. Parrish, S. Wang, J. Phang, A. Mohananey, P. M. Htut, P. Jeretic, and S. R. Bowman (2019) Investigating BERT’s knowledge of language: five analysis methods with NPIs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2877–2887. External Links: Document, Link Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) HuggingFace’s transformers: state-of-the-art natural language processing. External Links: 1910.03771 Cited by: §A.3.
  • Y. Xu, X. Liu, Y. Shen, J. Liu, and J. Gao (2019) Multi-task learning with sample re-weighting for machine reading comprehension. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2644–2655. External Links: Document, Link Cited by: §3.2.
  • P. Yang, Y. T. Chen, Y. Chen, and D. Cer (2021) NT5?! training t5 to perform numerical reasoning. External Links: 2104.07307 Cited by: §6.
  • P. Yin, G. Neubig, W. Yih, and S. Riedel (2020) TaBERT: pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8413–8426. External Links: Document, Link Cited by: §1, §6.
  • D. Yogatama, C. de Masson d’Autume, J. Connor, T. Kocisky, M. Chrzanowski, L. Kong, A. Lazaridou, W. Ling, L. Yu, C. Dyer, and P. Blunsom (2019) Learning and evaluating general linguistic intelligence. External Links: 1901.11373 Cited by: §3.1.
  • A. W. Yu, D. Dohan, Q. Le, T. Luong, R. Zhao, and K. Chen (2018) Fast and accurate reading comprehension by combining self-attention and convolution. In International Conference on Learning Representations, External Links: Link Cited by: §6.
  • T. Yu, C. Wu, X. V. Lin, bailin wang, Y. C. Tan, X. Yang, D. Radev, richard socher, and C. Xiong (2021) GraPPa: grammar-augmented pre-training for table semantic parsing. In International Conference on Learning Representations, External Links: Link Cited by: §1, §6.
  • S. Zhang, X. Zhang, W. Zhang, and A. Søgaard (2020) Worst-case-aware curriculum learning for zero and few shot transfer. External Links: 2009.11138 Cited by: §3.2.
  • Z. Zhao, S. Zhu, and K. Yu (2019) Data augmentation with atomic templates for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3637–3643. External Links: Document, Link Cited by: §1, §6.

Appendix A Supplemental Material

a.1 Data Generation

Table 10 contains the number of generated examples for every EG. During data generation, we randomly generate at most examples for each EG and table. Table 11 contains examples for generated triplets, including the full context . Fig. 5 presents the most common categories of the Wikipedia pages from which we scraped our tables.

EG # Questions
2-Hop composition 277,069
3-Hop composition 364,772
Conjunction 353,738
Only quantifier 522,071
Most quantifier 94,180
Every quantifier 16,693
Number comparison 410,749
Temporal comparison 453,499
Number boolean comparison 410,749
Temporal boolean comparison 470,694
Number superlatives 125,144
Temporal superlatives 80,884
Arithmetic superlatives 183,892
Arithmetic addition 86,969
Counting 484,471
Date difference 452,061
Total 4,787,635
Table 10: Number of examples generated by each EG.
EG Question Context Answer

3-hop
What was the Result(s) when the In League Cup of 1990-91 Chelsea F.C. season: The attendance when the round was R2 1st Leg was 5,666. 2-1
Composition Round was R4 in League Cup of The result when the date was 6 November 1990 was 3-2. The date when the attendance was 34,669 was 27
1990-91 Chelsea F.C. season? February 1991. The attendance when the round was QF was 34,178. The date when the attendance was
34,074 was 24 February 1991. The date when the attendance was 16,085 was 6 November 1990. The
attendance when the round was R3 was 16,699. The date when the attendance was 9,789 was 28 November
1990. The result when the date was 28 November 1990 was 2-1. The result when the date was 31 October
1990 was 0-0. The attendance when the round was QFR was 33,861. The result when the date was 16
January 1991 was 0-0. The attendance when the round was R4 was 9,789. The result when the date was
10 October 1990 was 4-1 (won 9-1 on agg). The date when the attendance was 5,666 was 26 September 1990.

Numerical
Which opponent has the highest In League Cup of 1990-91 Chelsea F.C. season: The attendances when the opponent was Tottenham Hotspur Sheffield
Superlatives attendance in League Cup of were 34,178 and 33,861. The attendances when the opponent was Sheffield Wednesday were 34,669 and Wednesday
1990-91 Chelsea F.C. season? 34,074. The attendance when the opponent was Oxford United was 9,789. The attendances when the
opponent was Portsmouth were 16,699 and 16,085. The attendances when the opponent was Walsall were
5,666 and 10,037.

Table 11: Examples for generated triplets. The examples were generated from the table shown in fig. 1. The gold facts for the composition question are indicated. The numerical superlatives question requires reasoning over all the facts in the context.
Figure 5: The most frequent categories of our Wikipedia pages and their frequency.

a.2 Training

Error sampling

Alg. 2 provides our error sampling algorithm.

Input: training time

1:for  do
2:     
3:
Algorithm 2 Error Sampling()

Momentum Sampling

For momentum sampling we use a window size of , a smoothing factor of , and sample at least examples from every task when training PReasM-Base and PReasM-Large.

a.3 Experimental Setup

Original pre-training task

In order to avoid catastrophic forgetting Kirkpatrick et al. (2016), we continue training with the span-corruption objective introduced in Raffel et al. (2020), over sequences of length 256 from the English Wikipedia.

Implementation details

We fine-tune all our experiments on one RTX8000 (48GB) or RTX3090 (24GB) GPU. We use the T5 model from https://huggingface.co/transformers/model_doc/t5.html Wolf et al. (2020).

Experiment Size LR Batch Size GAS Epochs
PReasM Base 1e-4 64 1 50
PReasM Large 1e-4 18 4 36
DROP Base 1e-4 20 1 20
IIRC Base 1e-4 20 1 60
IIRC Base 1e-4 20 1 60
MMQA Base 1e-4 6 3 20
DROP Large 5e-5 16 2 20
IIRC Large 5e-5 16 2 60
IIRC Large 5e-5 16 2 60
MMQA Large 1e-4 2 16 10
Table 12: Hyper-parameters used in all experiments, LR and GAS refer to learning-rate and gradient accumulation steps. In our PReasM experiments, epochs refer to the number of steps between evaluations, which is set to and for our base and large experiments, which leads to and optimization steps, respectively.

a.4 Experimental Results

Fig. 6 shows the results for T5 and PReasM on for both model sizes. T5-Large outperforms T5-Base on most tasks, suggesting that skills such as comparison and superlatives may have been picked up better during pre-training. However on tasks such as date difference and arithmetic addition the results T5-Large are very low, at around F. Our PReasM models significantly outperforms T5 on all tasks.

Figure 6: F for every task, for T5 and PReasM. The results for T5 were received by training in a few shot manner on 32 examples for 200 steps, as suggested in Ram et al. (2021).