|SQuAD 2 Rajpurkar et al. (2018)||Spans||✗||✗||✗||✗||✗|
|HotpotQA Yang et al. (2018)||Spans||✓||✓||✗||✗||✗|
|NewsQA Trischler et al. (2016)||Spans||✗||✗||✓||✗||✗|
|DROP Dua et al. (2019)||Numbers, entities||✓||✓||✗||✗||✗|
|NarrativeQA Kociský et al. (2018)||Free form||✗||✗||✗||✗||✗|
|CosmosQA Huang et al. (2019)||Multiple choice||✗||✗||✗||✗||✗|
|CommonsenseQA Talmor et al. (2018)||Multiple choice||✗||✗||✗||✗||✗|
|McTaco Zhou et al. (2019)||Multiple choice||✗||✗||✗||✗||✓|
Machine reading comprehension has become a crucial task in natural language understanding. In order to test the reasoning and inference over natural language, we turn to the task of question answering (QA). Recently there have been various QA datasets created to test single-hop reasoning Rajpurkar et al. (2016); Trischler et al. (2016) and multi-hop reasoning Yang et al. (2018); Welbl et al. (2017); Dua et al. (2019); Talmor and Berant (2018) abilities. Current approaches have shown successful progress on natural language understanding and also trigger huge improvement on model architectures Seo et al. (2016); Devlin et al. (2019); Nie et al. (2019); Liu et al. (2019).
However, existing QA datasets are limited in that they focus on testing QA system’s ability to find or locate answers from single or multiple paragraphs. For example, HotpotQA Yang et al. (2018) questions are designed to be answered given a set of paragraphs as the context, and most questions can be answered by locating answers from multiple sentences. While this dataset does test a systems’ ability to reason over multiple pieces of evidence, it does not examine systems’ temporal reasoning, or the ability to leverage temporal information in articles.
Although textual data are often accompanied by time information (e.g., dates in news articles), it is easily overlooked and not focused on existing datasets. For example, to answer the question “How long will Mexican asylum seekers seekers be held in the US by April 2019?” in Figure 1, we need to figure out the temporal relations between retrieved articles, and subsequently the temporal trajectory of each relevant supporting fact. Finally, humans would be able to infer the answer by using the information from each article and the extracted temporal relations, as long as the question is indeed answerable given the information available.
In this work, we construct ForecastQA, a new dataset that assesses a model’s temporal reasoning ability: resolving time information, causal relations, temporal relations, and inferring based on past events. To answer questions in ForecastQA, a model must figure out the temporal-related information on articles, as well as the temporal trends of evidence, while still demonstrating the previously displayed natural language understanding. ForecastQA is collected via a crowdsourcing effort based on news articles, where workers are shown articles and asked to come up with yes-no or multiple-choice questions and find supporting evidence for the questions. As a result, crowd workers crafted 5,704 yes-no questions and 4,513 multiple-choice questions that can test a model’s temporal reasoning ability. We also provide manually annotated articles for a subset of questions to test how helpful the curated articles are.
In our experiments, we investigate the difficulty of ForecastQA given gold articles from which the questions were made. We find that RoBERTa Liu et al. (2019) achieves 77.9% accuracy, suggesting ForecastQA is a difficult and valid dataset. Also, we design a method based on a pre-trained approach to deal with retrieved articles for an open-domain setting. The method achieves 61.6% accuracy on the ForecastQA dataset with retrieved articles by a BM25 information retriever.
|Q: What will help provide access to vital medicines in Africa by October 2019?|
|Choices: Infrastructure, New technologies (answer), Health workers, Basic medical care.|
Article: Drones carrying medicines and blood to Ghana’s rural millions. (4/24/19)
The world’s largest drone delivery network has been launched in Ghana where it will have the capacity to dispense medicines and vaccines to millions of people even in the remotest corners of the west African country.
The service will enable staff at 2,000 health centres to receive deliveries via a parachute drop within half an hour of texting in their orders.
|Reasoning Process: The world’s largest drone delivery network = one of new technologies (commonsense world knowledge). Ghana is in Africa (commonsense world knowledge). Dispense medicines and vaccines = provide access to vital medicines (paraphrase). They were delivered in April which will help them to provide vital medicines in October (temporal reasoning - inferring based on past events).|
2 Related Work
In this section, we review related work on question answering datasets. We divide question answering datasets into three categories: (1) extractive question answering, (2) temporal question answering, and (3) commonsense question answering. We summarize the key features of a collection of recent datasets in Table 1.
Extractive Question Answering. Recently there have been many extractive questions answering datasets produced, which ask single-hop questions Rajpurkar et al. (2016, 2018); Trischler et al. (2016), and multi-hop questions Yang et al. (2018). Although SQuAD Rajpurkar et al. (2016, 2018) and HotpotQA Yang et al. (2018) generate question answering datasets based on Wikipedia articles, their main difference lies in the number of passages needed to answer the question. Specifically, SQuAD finds the answer from one paragraph while HotpotQA requires multiple pieces of evidence from more than one paragraph to extract the answer to the question. However, both provide the answer to the question by locating a span of text from the passage. Similarly, NewsQA Trischler et al. (2016) uses the same extraction method to answer the question but this dataset is based on CNN news articles. Compared to their task, our ForecastQA dataset requires temporal reasoning ability, i.e., reasoning over the historical evidence from multiple news articles. Also our answer type is yes/no or one of the given multiple choices.
Temporal Question Answering. There were attempts to build temporal question answering datasets Jia et al. (2018a, b); Sun et al. (2018) that focus on analysis of temporal intent. Temporal intent being defined as the detection of key context words, such as “before”, which depict time dependency between events, also known as temporal relation. It can also identify cause and effect events, i.e., causal relations. Some examples of datasets which utilize temporal intent are TempQuestions Jia et al. (2018a) and TEQUILA Jia et al. (2018b). Both decompose each of their questions into two sub-questions to investigate temporal relations. In addition, Sun et al. (2018) proposes event graphs according to the given text and uses those to figure out which event happened first. The aforementioned datasets only concentrate on analyzing questions that contain temporal relation or causal relation. In contrast, our ForecastQA requires more complex reasoning in addition to temporal relations or causal relations.
Commonsense Question Answering. Most QA systems focus on factoid questions while CommonsenseQA Talmor et al. (2018) asks about common sense and is designed to map the answer to each question according to common sense knowledge without additional information. Likewise, SocialIQA Sap et al. (2019) is a commonsense dataset about social situations. This dataset reasons over social and emotional interactions to apply human commonsense; it does this by reasoning about motivation, what happens next, and emotional reactions. On the other hand, our ForecastQA dataset uses supporting evidence from relevant articles and might need common sense to answer questions. Furthermore, MCTaco Zhou et al. (2019) also deals with temporal questions that require temporal commonsense reasoning. The major difference between it and ours is that questions in MCTaco are answerable by common sense, whereas ours require complex reasoning including commonsense.
3 The ForecastQA Dataset
Now we introduce our ForecastQA task and data collection. The goal of the ForecastQA task is to test a temporal reasoning ability, different from existing QA datasets.
ForecastQA Task. The input of our task is a forecasting question and a news article corpus , while the output is yes/no or one of the given choices. The distinction of our ForecastQA dataset is that models have access only to the articles such that , i.e., only past news articles are accessible. For example, given a question “How long will Mexican asylum seekers be held in the US by April 2019?” in Figure 1, models can access articles that were released before April 2019. This setup makes our dataset more challenging and distinct from existing QA datasets.
Next, we will show how we created the ForecastQA dataset.
Data Collection. To get questions that require temporal reasoning, we should collect questions about future events. However, creating future questions are not straightforward; we never know what will happen in the future. Instead, we make questions from past news articles. Our question answering dataset is collected in the following three steps: (1) curating articles, and (2) crowdsourcing question-answers on those articles. Additionally, we also collect relevant news articles for a subset of questions that are manually annotated by crowd workers to analyze how helpful human-curated articles are.
Below we will go deeper into the three stages mentioned above; how we collect news articles (Section 3.1), how we make questions (Section 3.2), and how we find appropriate and relevant articles (Section 3.3).
3.1 Collecting News Articles
We gather news articles from LexisNexis111https://risk.lexisnexis.com/our-technology, a third-party service that collects news articles. After obtaining news articles, we find that the collected news articles include non-English articles and unreliable sources. Hence, we manually select 21 news sources and filter out articles before 2015-01-01 and after 2019-11-31, and finally get a total of 509,776 articles.
|Number of questions||8,172||683||1,362||10,217|
3.2 Making Questions via AMT
Next, we employed crowdsourcing to create questions via Amazon Mechanical Turk222https://www.mturk.com. We randomly selected the news articles ranging from Jan. 2019 to Nov. 2019 in our news article corpus to create questions. It is non-trivial to make questions about the future event since we don’t know answers to the questions. Instead, we asked crowd workers to simply create questions about the events in the news articles. Also, they were asked to add a time frame into the questions to inject a temporal element. By doing this, our question answering dataset tests a model’s temporal reasoning aspect.
To diversify questions, we create two kinds of questions: yes-no questions and multiple-choice questions. The former ones are questions whose expected answer is either “yes” or “no”, while the latter ones are questions whose expected answer is one among choices. Yes-no questions can be formed in positive and negative forms, e.g., “Will primary schools admit non-vaccinated children?” and “Won’t primary schools admit non-vaccinated children?” However, we try not to use negative forms and try to use antonym instead to make questions more natural. Such yes-no questions require the capability of finding correctness of questions. On the other hand, multiple-choice questions start with five Whs (i.e., who, what, when, where, and why). Multi-choice questions are more challenging since they require the capability of finding the correctness of each choice.
Crowd workers were encouraged to make a yes and a no questions from each article, and one multi-choice question with four choices from each. To control lazy crowd workers, they were tasked with finding a sentence or evidence from the given article and make a question from the evidence. In addition, they were encouraged to ask questions in their own words, without copying word phrases from the articles. We employed a rule-based filtering method to rule out undesirable questions. Therefore, we collected 5,704 yes-no questions and 4,513 multi-choice questions. Table 3 shows the basic statistics of the questions. Figure 2 shows the date distribution of gold articles (the ones used to collect questions and answer). The dates range from 01-01-2019 to 11-31-2019.
3.3 Collecting Relevant Articles to Questions
Here, we are collecting relevant articles to a subset of questions in order to get a sense of how helpful the human-curated articles are. To get manually annotated articles we rely on crowdsourcing, Amazon Mechanical Turk. Crowd workers were given a set of articles and a question, and asked to find relevant articles to the question. We used BM25 Qi et al. (2019) to get the set of articles for each question. They were also encouraged to find a piece of evidence to get an answer or a hint for the question. As a results, articles for 883 questions were annotated by workers. Table 4 shows the distribution of articles. For example, 219 out of 883 questions have one relevant article. In the next section, we are going to analyze our dataset.
4 Data Analysis
We now look deeper into the final quality of our ForecastQA dataset, mainly looking into the: 1) types of questions, and 2) types of reasoning required to answer the questions.
4.1 Types of Questions
We perform the analysis of the question types by analyzing the first two words in the questions. Figure 4 shows a sunburst plot of the first two words in questions. Empty colored blocks indicate words that are too rare to show individually. As is shown, nearly half of the questions are dominated by will questions, which is because most yes-no questions start with will. Since ForecastQA contains forecasting questions, some questions start with In and a month to specify a time frame.
|Reasoning Type||Detailed Reasoning Type||%|
|Language Understanding||Lexical variations (synonymy, coreference)||17.03|
|Syntactic variations (paraphrase)||24.44|
|Multi-hop Reasoning||Checking multiple properties||3.33|
|Numerical Reasoning||Addition, Subtraction||1.85|
|Commonsense Reasoning||World knowledge||13.33|
|Temporal Reasoning||Resolving time information||8.88|
|Inferring based on past events||11.85|
4.2 Reasoning Types
To get a better understanding of the reasoning required to answer the questions, we sampled 100 questions and manually analyzed the reasoning types, which is summarized in Table 5. Figure 3 shows representative examples.
Language Understanding. We introduce lexical variations and syntactic variations following Rajpurkar et al. (2016, 2018). Lexical variations represent synonyms or coreferences between the question and the evidence sentence. When the question is paraphrased into another syntactic form and the evidence sentence is matched to the form, we call it syntactic variations. We find that many questions require language understanding; lexical variations account for 17.03% and syntactic variations do for 24.44%.
Multi-hop Reasoning. Some questions require multi-hop reasoning Yang et al. (2018), such as checking multiple properties (3.33%) and bridge entities (1.85%) . The former one requires finding multiple properties from an article to find an answer. The latter one works as a bridge between two entities, where one must identify a bridge entity, and find the answer in the second hop.
Numerical Reasoning. To answer our questions, one needs numerical reasoning Dua et al. (2019). The answer is found by adding or subtracting two numbers (1.85%), or comparing two numbers (2.86%) in the given articles.
Commonsense Reasoning. The questions also require world knowledge Talmor et al. (2018), social commonsense Sap et al. (2019), and temporal commonsense Zhou et al. (2019). Apart from the given articles, we must exploit world knowledge or social interactions to answer the questions. We find that 13.33% of the questions need world knowledge and 2.59% of questions require social commonsense. The other type of commonsense reasoning is temporal commonsense which is related to temporal knowledge Zhou et al. (2019). 3.33% of questions are related to temporal commonsense.
Temporal Reasoning. To answer ForecastQA questions, one must leverage temporal information since each event is related to a certain time stamp and each question asks about the future event. So we define sub-types of temporal reasoning: resolving time information for individual events mentioned in each document, causal relations between events, temporal relations, and inferring future events based on past events. As shown in Figure 3, a question requires resolving time information to answer the question, e.g., ‘last year’ refers to 2018 in the example. Causal relations indicate cause and effect; one event is influenced by another event, process, or state. In addition, temporal relations refer to precedence between two events. Based on past evidence or events, we can infer the future events or states, which is called “inferring based on past events.” In our analysis, questions require resolving time information (8.88%), causal relations (6.29%), temporal relations (2.22%), and inferring based on past events (11.85%).
|Methods / Metrics||Accuracy||Brier score|
|BERT (text encoder)|
|Title & content||0.730||0.838||0.778||0.419||0.243||0.342|
|Title & content||0.722||0.852||0.779||0.377||0.235||0.314|
5 Models for ForecastQA
To test the performance of leading QA systems on our data, we model our design based on the pre-trained language models, BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019). Figure 5 shows the architecture of our method.
Formally, the model receives a question and a set of articles where as input. Following Devlin et al. (2019), where is a special symbol added in front of every input example, and is a special separator token to separate questions and answers. is fed into :
where is the BERT or RoBERTa and is the hidden dimension. The function
produces one vector for each token, including the vector corresponding towhich is used as a “pooled” representation of the sequence. Next, we use an Aggregate function
is a sigmoid function,, and is a set of articles.
In the case of multiple choice questions, we train the model with each choice separately and treat them as a binary classification task. We concatenate the question, a choice, and an article, and use them as an input. Given a question , a choice , and an article , then input will be .
Next, we introduce the various Aggregate functions we used.
The news articles are represented as a sequence by release dates. The idea of the sequential aggregate is to utilize the sequential nature of news articles. To do this, we introduce a Gated Recurrent UnitCho et al. (2014) as our function. We first sort the news articles by release dates, and feed them into a GRU as follows:
We also introduce the time encoding to the GRU so that GRU is aware of dates. Dates are first converted to the UNIX time format and we embed this time into a vector, and the encoded vector is appended to the representation of the [CLS] token. Then, the equation (3) becomes .
Set Aggregate. Other ways of dealing with multiple articles are via a max-pooling operation or a summarizer
. Different from the sequential aggregator, these do not take the order of articles into consideration. The max-pooling operator is defined as a element-wise max operation over the vector representations of [CLS] tokens, e.g.,In addition, we adopt an MMR summarizer Carbonell and Goldstein (1998), an extractive and multi-document summarizer to summarize the set of news articles into one short document. The summarized document is fed into a language model, and the transformed representation of the [CLS] token is used for prediction.
6 Evaluation on ForecastQA
In this section, we describe experimental setups and experimental results.
6.1 Experimental Setup
Before diving into our main task, we first analyze the difficulty of our dataset given gold articles to show the validity of our dataset. Then, we evaluate our method in an open-domain setting using BM25 Qi et al. (2019), which is to examine the difficulty of the ForecastQA dataset with retrieved articles. In this setting, models should answer the question given past news articles where models cannot directly infer answers. This is because we are testing a temporal reasoning ability of the model. If answers are already in the article, then it becomes a simple language understanding task.
We also test our model with manually annotated articles. We retrieve 10 articles for the first setting and use BERT-base and RoBERTa-base. Random guessing is a random predictor which is defined by the uniform distribution.
Evaluation metrics. To evaluate performances, we adopt accuracy and Brier scores. The Brier score measures the accuracy of probabilistic prediction; it is the mean squared error of the probabilities. So formal definition of the Brier score is:
where is the probability of prediction, is a label of the instance, is the prediction instances, and is the number of classes. The lower the Brier score is, the better the performance is.
6.2 Performance Study and Analysis
In this section, we first analyze the validity of our dataset and discuss experimental results on an open-domain setting and on manually annotated articles.
with different types of inputs. We use a fully connected layer with a sigmoid activation function on top of the vector corresponding to thetoken. We vary the input: article titles (T), article contents (C), titles and contents (T,C), evidence (E), only questions. Evidence is a sentence from gold articles, and questions are made from evidence; the questions can be answered by evidence. As shown in Table 6, RoBERTa with evidence exhibits the best performances among other variants, which indicates the pre-trained models can achieve good performances given evidence. From the comparison between BERT with titles (T) and BERT with contents (C), contents may have more useful information to answer the question than titles. Interestingly, BERT and RoBERTa with only questions show decent performances, which is better than the random guessing. This analysis suggests that our dataset is not an easy task and proves the validity of our dataset.333We will update human performances in the future update.
|Methods / Metrics||Accuracy||Brier score|
|Methods / Metrics||Accuracy|
|Methods / Metrics||Accuracy|
Experiments on an Open-domain Setting. Table 7 shows the results of the proposed approaches with a BM25 retriever. We compare the pre-trained language models with various aggregate functions. We find that BERT with the maxpool operator shows the best performances. Also it outperforms BERT with GRU, which is probably because the set of articles may not show a meaningful sequence. Instead, the maxpool operator picks important information from the set of articles. We notice that RoBERTa with GRU+Time encoding does not make differences from RoBERTa with GRU, suggesting that sophisticated modeling is required. The MMR summarizer helps compressing the given articles and thus BERT with summarization achieves similar performances to BERT+Maxpool.
In addition, we test retrieval methods: BM25 Qi et al. (2019) and TF-IDF Chen et al. (2017). As is shown in the Table 8, the BM25 retriever shows the better performances than the TF-IDF retriever for BERT and RoBERTa.
Experiments with Manually Annotated Articles. Table 9 shows the results on manually annotated articles. We annotated articles for a subset of questions to show how helpful the curated articles. 82.3% questions have 4 or less curated articles as shown in Table 4, while we use retrieved 10 articles for the counterpart. To avoid information loss, we also exploit additional retrieved articles to the curated ones, and thus each questions have 10 articles including annotated articles. As is shown, BERT+Maxpool with manually annotated articles slightly outperforms with BM25.
6.3 Error Analysis
We randomly select 50 errors made by our best approach from the test set, and identify 4 phenomena:
Wrong Retrieved Articles. In 28% of the errors, wrong articles are retrieved and models cannot find supporting facts from articles. Our approach relies on information retrieval models such as BM25. Some retrieved articles do not have supporting facts for the questions, and thus our model predicts wrong answers. To resolve this error, we can adopt improved retrieval models.
Finding Wrong Evidence. 24% errors are due to finding wrong evidence. Even though appropriate articles are retrieved, model find a wrong sentence as a support fact. For example, there is a question, “Who will be responsible for the Pentagon cloud lawsuit in November 2019?” The model predicted an answer, Amazon, from a sentence “Amazon will deliver the pentagon to the cloud.” However, this is not supporting evidence to the question.
Lacking Human Common Sense. In 32% of the errors, the model selected the wrong choice since it lacks common sense or world knowledge. For example, we have a question, “Which European city will not be part of Jet Airways long haul suspended service cities by April 2019?”. To answer this question, the model should know the country of the city. Lack of the knowledge causes the wrong answer.
Ambiguous Questions. 12% errors are from ambiguous questions. Some questions are not clear and it can have multiple answers depending on time. For example, a question, ”What will Pope Francis apologize for in September 2019?”, might have multiple answers depending on a time.
To test the temporal reasoning ability of current question answering approaches, we introduce ForecastQA, a forecasting question answering dataset on news articles with crowdsourced question-answer pairs. The task is inherently challenging as the questions require temporal reasoning and complex compositions of various types of reasoning such as commonsense, multi-hop reasoning. Most widely used baseline methods including BERT and RoBERTa do not show desirable performances. We believe our benchmark dataset can benefit future research beyond natural language understanding and expect the performances will be significantly improved.
- The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. Cited by: §5.
- Reading wikipedia to answer open-domain questions. In ACL, Cited by: §6.2.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §5, §5.
- DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL-HLT, Cited by: Table 1, §1, §4.2.
- Cosmos qa: machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277. Cited by: Table 1.
- TempQuestions: a benchmark for temporal question answering. In WWW, Cited by: §2.
- TEQUILA: temporal question answering over knowledge bases. ArXiv abs/1908.03650. Cited by: §2.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: Table 1.
- RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1, §1, §5, §6.2.
- Revealing the importance of semantic retrieval for machine reading at scale. In EMNLP/IJCNLP, Cited by: §1.
- Answering complex open-domain questions through iterative query generation. In EMNLP/IJCNLP, Cited by: §3.3, §6.1, §6.2.
- Know what you don’t know: unanswerable questions for squad. In ACL, Cited by: Table 1, §2, §4.2.
- SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, Cited by: §1, §2, §4.2.
- Social iqa: commonsense reasoning about social interactions. In EMNLP 2019, Cited by: §2, §4.2.
- Bidirectional attention flow for machine comprehension. ArXiv abs/1611.01603. Cited by: §1.
- Reading comprehension with graph-based temporal-casual reasoning. In COLING, Cited by: §2.
- The web as a knowledge-base for answering complex questions. In NAACL-HLT, Cited by: §1.
- CommonsenseQA: a question answering challenge targeting commonsense knowledge. In NAACL-HLT, Cited by: Table 1, §2, §4.2.
- NewsQA: a machine comprehension dataset. In Rep4NLP@ACL, Cited by: Table 1, §1, §2.
- Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §1.
- HotpotQA: a dataset for diverse, explainable multi-hop question answering. In EMNLP, Cited by: Table 1, §1, §1, §2, §4.2.
- ”Going on a vacation” takes longer than ”going for a walk”: a study of temporal commonsense understanding. In EMNLP/IJCNLP, Cited by: Table 1, §2, §4.2.