ForecastQA: Machine Comprehension of Temporal Text for Answering Forecasting Questions

05/02/2020 ∙ by Woojeong Jin, et al. ∙ University of Southern California 12

Textual data are often accompanied by time information (e.g., dates in news articles), but the information is easily overlooked on existing question answering datasets. In this paper, we introduce ForecastQA, a new open-domain question answering dataset consisting of 10k questions which requires temporal reasoning. ForecastQA is collected via a crowdsourcing effort based on news articles, where workers were asked to come up with yes-no or multiple-choice questions. We also present baseline models for our dataset, which is based on a pre-trained language model. In our study, our baseline model achieves 61.6 accuracy on the ForecastQA dataset. We expect that our new data will support future research efforts. Our data and code are publicly available at



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Two kinds of forecasting questions in ForecastQA: a multiple-choice question and a yes-no question. The questions require temporal reasoning over the evidence in the retrieved articles. The bold choices are answers.
Dataset Answer Type
Multiple choice,
SQuAD 2 Rajpurkar et al. (2018) Spans
HotpotQA Yang et al. (2018) Spans
NewsQA Trischler et al. (2016) Spans
DROP Dua et al. (2019) Numbers, entities
NarrativeQA Kociský et al. (2018) Free form
CosmosQA Huang et al. (2019) Multiple choice
CommonsenseQA Talmor et al. (2018) Multiple choice
McTaco Zhou et al. (2019) Multiple choice
Table 1: Comparison of the ForecastQA dataset to other question answering datasets.

Machine reading comprehension has become a crucial task in natural language understanding. In order to test the reasoning and inference over natural language, we turn to the task of question answering (QA). Recently there have been various QA datasets created to test single-hop reasoning Rajpurkar et al. (2016); Trischler et al. (2016) and multi-hop reasoning Yang et al. (2018); Welbl et al. (2017); Dua et al. (2019); Talmor and Berant (2018) abilities. Current approaches have shown successful progress on natural language understanding and also trigger huge improvement on model architectures Seo et al. (2016); Devlin et al. (2019); Nie et al. (2019); Liu et al. (2019).

However, existing QA datasets are limited in that they focus on testing QA system’s ability to find or locate answers from single or multiple paragraphs. For example, HotpotQA Yang et al. (2018) questions are designed to be answered given a set of paragraphs as the context, and most questions can be answered by locating answers from multiple sentences. While this dataset does test a systems’ ability to reason over multiple pieces of evidence, it does not examine systems’ temporal reasoning, or the ability to leverage temporal information in articles.

Although textual data are often accompanied by time information (e.g., dates in news articles), it is easily overlooked and not focused on existing datasets. For example, to answer the question “How long will Mexican asylum seekers seekers be held in the US by April 2019?” in Figure 1, we need to figure out the temporal relations between retrieved articles, and subsequently the temporal trajectory of each relevant supporting fact. Finally, humans would be able to infer the answer by using the information from each article and the extracted temporal relations, as long as the question is indeed answerable given the information available.

In this work, we construct ForecastQA, a new dataset that assesses a model’s temporal reasoning ability: resolving time information, causal relations, temporal relations, and inferring based on past events. To answer questions in ForecastQA, a model must figure out the temporal-related information on articles, as well as the temporal trends of evidence, while still demonstrating the previously displayed natural language understanding. ForecastQA is collected via a crowdsourcing effort based on news articles, where workers are shown articles and asked to come up with yes-no or multiple-choice questions and find supporting evidence for the questions. As a result, crowd workers crafted 5,704 yes-no questions and 4,513 multiple-choice questions that can test a model’s temporal reasoning ability. We also provide manually annotated articles for a subset of questions to test how helpful the curated articles are.

In our experiments, we investigate the difficulty of ForecastQA given gold articles from which the questions were made. We find that RoBERTa Liu et al. (2019) achieves 77.9% accuracy, suggesting ForecastQA is a difficult and valid dataset. Also, we design a method based on a pre-trained approach to deal with retrieved articles for an open-domain setting. The method achieves 61.6% accuracy on the ForecastQA dataset with retrieved articles by a BM25 information retriever.

Q: What will help provide access to vital medicines in Africa by October 2019?
Choices: Infrastructure, New technologies (answer), Health workers, Basic medical care.
Article: Drones carrying medicines and blood to Ghana’s rural millions. (4/24/19)
The world’s largest drone delivery network has been launched in Ghana where it will have the capacity to dispense medicines and vaccines to millions of people even in the remotest corners of the west African country.
The service will enable staff at 2,000 health centres to receive deliveries via a parachute drop within half an hour of texting in their orders.
Reasoning Process: The world’s largest drone delivery network = one of new technologies (commonsense world knowledge). Ghana is in Africa (commonsense world knowledge). Dispense medicines and vaccines = provide access to vital medicines (paraphrase). They were delivered in April which will help them to provide vital medicines in October (temporal reasoning - inferring based on past events).
Table 2: Detailed example to show how to solve the question. The bold word in the choices is the answer. The question requires commonsense world knowledge, paraphrase, and temporal reasoning.

2 Related Work

In this section, we review related work on question answering datasets. We divide question answering datasets into three categories: (1) extractive question answering, (2) temporal question answering, and (3) commonsense question answering. We summarize the key features of a collection of recent datasets in Table 1.

Extractive Question Answering. Recently there have been many extractive questions answering datasets produced, which ask single-hop questions Rajpurkar et al. (2016, 2018); Trischler et al. (2016), and multi-hop questions Yang et al. (2018). Although SQuAD Rajpurkar et al. (2016, 2018) and HotpotQA Yang et al. (2018) generate question answering datasets based on Wikipedia articles, their main difference lies in the number of passages needed to answer the question. Specifically, SQuAD finds the answer from one paragraph while HotpotQA requires multiple pieces of evidence from more than one paragraph to extract the answer to the question. However, both provide the answer to the question by locating a span of text from the passage. Similarly, NewsQA Trischler et al. (2016) uses the same extraction method to answer the question but this dataset is based on CNN news articles. Compared to their task, our ForecastQA dataset requires temporal reasoning ability, i.e., reasoning over the historical evidence from multiple news articles. Also our answer type is yes/no or one of the given multiple choices.

Temporal Question Answering. There were attempts to build temporal question answering datasets Jia et al. (2018a, b); Sun et al. (2018) that focus on analysis of temporal intent. Temporal intent being defined as the detection of key context words, such as “before”, which depict time dependency between events, also known as temporal relation. It can also identify cause and effect events, i.e., causal relations. Some examples of datasets which utilize temporal intent are TempQuestions Jia et al. (2018a) and TEQUILA Jia et al. (2018b). Both decompose each of their questions into two sub-questions to investigate temporal relations. In addition, Sun et al. (2018) proposes event graphs according to the given text and uses those to figure out which event happened first. The aforementioned datasets only concentrate on analyzing questions that contain temporal relation or causal relation. In contrast, our ForecastQA requires more complex reasoning in addition to temporal relations or causal relations.

Commonsense Question Answering. Most QA systems focus on factoid questions while CommonsenseQA Talmor et al. (2018) asks about common sense and is designed to map the answer to each question according to common sense knowledge without additional information. Likewise, SocialIQA Sap et al. (2019) is a commonsense dataset about social situations. This dataset reasons over social and emotional interactions to apply human commonsense; it does this by reasoning about motivation, what happens next, and emotional reactions. On the other hand, our ForecastQA dataset uses supporting evidence from relevant articles and might need common sense to answer questions. Furthermore, MCTaco Zhou et al. (2019) also deals with temporal questions that require temporal commonsense reasoning. The major difference between it and ours is that questions in MCTaco are answerable by common sense, whereas ours require complex reasoning including commonsense.

3 The ForecastQA Dataset

Now we introduce our ForecastQA task and data collection. The goal of the ForecastQA task is to test a temporal reasoning ability, different from existing QA datasets.

ForecastQA Task. The input of our task is a forecasting question and a news article corpus , while the output is yes/no or one of the given choices. The distinction of our ForecastQA dataset is that models have access only to the articles such that , i.e., only past news articles are accessible. For example, given a question “How long will Mexican asylum seekers be held in the US by April 2019?” in Figure 1, models can access articles that were released before April 2019. This setup makes our dataset more challenging and distinct from existing QA datasets.

Next, we will show how we created the ForecastQA dataset.

Data Collection. To get questions that require temporal reasoning, we should collect questions about future events. However, creating future questions are not straightforward; we never know what will happen in the future. Instead, we make questions from past news articles. Our question answering dataset is collected in the following three steps: (1) curating articles, and (2) crowdsourcing question-answers on those articles. Additionally, we also collect relevant news articles for a subset of questions that are manually annotated by crowd workers to analyze how helpful human-curated articles are.

Below we will go deeper into the three stages mentioned above; how we collect news articles (Section 3.1), how we make questions (Section 3.2), and how we find appropriate and relevant articles (Section 3.3).

3.1 Collecting News Articles

We gather news articles from LexisNexis111, a third-party service that collects news articles. After obtaining news articles, we find that the collected news articles include non-English articles and unreliable sources. Hence, we manually select 21 news sources and filter out articles before 2015-01-01 and after 2019-11-31, and finally get a total of 509,776 articles.

Statistics Train Dev Test All
Number of questions 8,172 683 1,362 10,217
Yes-no questions 4,562 382 760 5,704
Multi-choice questions 3,610 301 602 4,513
Average length 14.45 14.73 14.47 14.47
Table 3: The basic statistics of ForecastQA data. The average length refers to the average number of tokens in the questions.

3.2 Making Questions via AMT

Next, we employed crowdsourcing to create questions via Amazon Mechanical Turk222 We randomly selected the news articles ranging from Jan. 2019 to Nov. 2019 in our news article corpus to create questions. It is non-trivial to make questions about the future event since we don’t know answers to the questions. Instead, we asked crowd workers to simply create questions about the events in the news articles. Also, they were asked to add a time frame into the questions to inject a temporal element. By doing this, our question answering dataset tests a model’s temporal reasoning aspect.

Figure 2: Date distribution of gold articles for questions. Each question is made from gold articles. The dates denote release dates of news articles and they range from 01-01-2019 to 11-31-2019.
# Articles 1 2 3 4 5 6 7 8 9 10 All
Yes/no 64 83 96 158 70 34 12 2 2 5 526
Multichoice 155 93 46 32 18 6 2 1 1 3 357
Questions 219 176 142 190 88 40 14 3 3 8 883
Table 4: Distribution of manually annotated articles. Yes/no refers to yes-no questions and multichoice refers to multi-choice questions. We collect articles for 883 questions, a subset of our data: 526 yes-no questions and 357 multi-choice questions.

To diversify questions, we create two kinds of questions: yes-no questions and multiple-choice questions. The former ones are questions whose expected answer is either “yes” or “no”, while the latter ones are questions whose expected answer is one among choices. Yes-no questions can be formed in positive and negative forms, e.g., “Will primary schools admit non-vaccinated children?” and “Won’t primary schools admit non-vaccinated children?” However, we try not to use negative forms and try to use antonym instead to make questions more natural. Such yes-no questions require the capability of finding correctness of questions. On the other hand, multiple-choice questions start with five Whs (i.e., who, what, when, where, and why). Multi-choice questions are more challenging since they require the capability of finding the correctness of each choice.

Figure 3: Examples of each type of reasoning in ForecastQA. Words relevant to the corresponding reasoning type are bolded.

Crowd workers were encouraged to make a yes and a no questions from each article, and one multi-choice question with four choices from each. To control lazy crowd workers, they were tasked with finding a sentence or evidence from the given article and make a question from the evidence. In addition, they were encouraged to ask questions in their own words, without copying word phrases from the articles. We employed a rule-based filtering method to rule out undesirable questions. Therefore, we collected 5,704 yes-no questions and 4,513 multi-choice questions. Table 3 shows the basic statistics of the questions. Figure 2 shows the date distribution of gold articles (the ones used to collect questions and answer). The dates range from 01-01-2019 to 11-31-2019.

3.3 Collecting Relevant Articles to Questions

Here, we are collecting relevant articles to a subset of questions in order to get a sense of how helpful the human-curated articles are. To get manually annotated articles we rely on crowdsourcing, Amazon Mechanical Turk. Crowd workers were given a set of articles and a question, and asked to find relevant articles to the question. We used BM25 Qi et al. (2019) to get the set of articles for each question. They were also encouraged to find a piece of evidence to get an answer or a hint for the question. As a results, articles for 883 questions were annotated by workers. Table 4 shows the distribution of articles. For example, 219 out of 883 questions have one relevant article. In the next section, we are going to analyze our dataset.

4 Data Analysis

Figure 4: Distribution of the first (the inner circle) and second words (the outer circle) in questions. Empty colored blocks indicate words that are too rare to show individually.

We now look deeper into the final quality of our ForecastQA dataset, mainly looking into the: 1) types of questions, and 2) types of reasoning required to answer the questions.

4.1 Types of Questions

We perform the analysis of the question types by analyzing the first two words in the questions. Figure 4 shows a sunburst plot of the first two words in questions. Empty colored blocks indicate words that are too rare to show individually. As is shown, nearly half of the questions are dominated by will questions, which is because most yes-no questions start with will. Since ForecastQA contains forecasting questions, some questions start with In and a month to specify a time frame.

Reasoning Type Detailed Reasoning Type %
Language Understanding Lexical variations (synonymy, coreference) 17.03
Syntactic variations (paraphrase) 24.44
Multi-hop Reasoning Checking multiple properties 3.33
Bridge entity 1.85
Numerical Reasoning Addition, Subtraction 1.85
Comparison 2.96
Commonsense Reasoning World knowledge 13.33
Social commonsense 2.59
Temporal commonsense 3.33
Temporal Reasoning Resolving time information 8.88
Causal relations 6.29
Temporal relations 2.22
Inferring based on past events 11.85
Table 5: Types of reasoning required to answer questions in the ForecastQA dataset. 100 questions are manually analyzed. On average, 2.7 reasoning types are required for each question.

4.2 Reasoning Types

To get a better understanding of the reasoning required to answer the questions, we sampled 100 questions and manually analyzed the reasoning types, which is summarized in Table 5. Figure 3 shows representative examples.

Language Understanding. We introduce lexical variations and syntactic variations following Rajpurkar et al. (2016, 2018). Lexical variations represent synonyms or coreferences between the question and the evidence sentence. When the question is paraphrased into another syntactic form and the evidence sentence is matched to the form, we call it syntactic variations. We find that many questions require language understanding; lexical variations account for 17.03% and syntactic variations do for 24.44%.

Multi-hop Reasoning. Some questions require multi-hop reasoning Yang et al. (2018), such as checking multiple properties (3.33%) and bridge entities (1.85%) . The former one requires finding multiple properties from an article to find an answer. The latter one works as a bridge between two entities, where one must identify a bridge entity, and find the answer in the second hop.

Numerical Reasoning. To answer our questions, one needs numerical reasoning Dua et al. (2019). The answer is found by adding or subtracting two numbers (1.85%), or comparing two numbers (2.86%) in the given articles.

Commonsense Reasoning. The questions also require world knowledge Talmor et al. (2018), social commonsense Sap et al. (2019), and temporal commonsense Zhou et al. (2019). Apart from the given articles, we must exploit world knowledge or social interactions to answer the questions. We find that 13.33% of the questions need world knowledge and 2.59% of questions require social commonsense. The other type of commonsense reasoning is temporal commonsense which is related to temporal knowledge Zhou et al. (2019). 3.33% of questions are related to temporal commonsense.

Temporal Reasoning. To answer ForecastQA questions, one must leverage temporal information since each event is related to a certain time stamp and each question asks about the future event. So we define sub-types of temporal reasoning: resolving time information for individual events mentioned in each document, causal relations between events, temporal relations, and inferring future events based on past events. As shown in Figure 3, a question requires resolving time information to answer the question, e.g., ‘last year’ refers to 2018 in the example. Causal relations indicate cause and effect; one event is influenced by another event, process, or state. In addition, temporal relations refer to precedence between two events. Based on past evidence or events, we can infer the future events or states, which is called “inferring based on past events.” In our analysis, questions require resolving time information (8.88%), causal relations (6.29%), temporal relations (2.22%), and inferring based on past events (11.85%).

Methods / Metrics Accuracy Brier score
yes/no multi all yes/no multi all
Random guessing 0.478 0.238 0.372 0.705 0.819 0.755
BERT (text encoder)
   Question 0.650 0.423 0.549 0.504 0.679 0.581
   Title 0.659 0.636 0.649 0.455 0.470 0.461
   Article content 0.701 0.808 0.748 0.437 0.283 0.369
   Title & content 0.730 0.838 0.778 0.419 0.243 0.342
   Gold evidence 0.775 0.868 0.816 0.383 0.199 0.302
   Question 0.650 0.425 0.550 0.473 0.686 0.567
   Title 0.700 0.616 0.662 0.406 0.468 0.433
   Article content 0.702 0.840 0.763 0.389 0.279 0.341
   Title & content 0.722 0.852 0.779 0.377 0.235 0.314
   Gold evidence 0.806 0.892 0.844 0.326 0.177 0.260
Table 6: Results on gold articles on the ForecastQA test set. We give different inputs to the BERT and RoBERTa to find out which part is important for the questions.
Figure 5: Our model architecture. The aggregator collects the information from each article. For multi-choice questions, we train the model with each choice separately and treat them as a binary classification task.

5 Models for ForecastQA

To test the performance of leading QA systems on our data, we model our design based on the pre-trained language models, BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019). Figure 5 shows the architecture of our method.

Formally, the model receives a question and a set of articles where as input. Following Devlin et al. (2019), where is a special symbol added in front of every input example, and is a special separator token to separate questions and answers. is fed into :


where is the BERT or RoBERTa and is the hidden dimension. The function

produces one vector for each token, including the vector corresponding to

which is used as a “pooled” representation of the sequence. Next, we use an Aggregate function

and add a 2-way classifier to get the probability of yes:



is a sigmoid function,

, and is a set of articles.

In the case of multiple choice questions, we train the model with each choice separately and treat them as a binary classification task. We concatenate the question, a choice, and an article, and use them as an input. Given a question , a choice , and an article , then input will be .

Next, we introduce the various Aggregate functions we used.

Sequential Aggregate.

The news articles are represented as a sequence by release dates. The idea of the sequential aggregate is to utilize the sequential nature of news articles. To do this, we introduce a Gated Recurrent Unit 

Cho et al. (2014) as our function. We first sort the news articles by release dates, and feed them into a GRU as follows:


We also introduce the time encoding to the GRU so that GRU is aware of dates. Dates are first converted to the UNIX time format and we embed this time into a vector, and the encoded vector is appended to the representation of the [CLS] token. Then, the equation (3) becomes .

Set Aggregate. Other ways of dealing with multiple articles are via a max-pooling operation or a summarizer

. Different from the sequential aggregator, these do not take the order of articles into consideration. The max-pooling operator is defined as a element-wise max operation over the vector representations of [CLS] tokens, e.g.,

In addition, we adopt an MMR summarizer Carbonell and Goldstein (1998), an extractive and multi-document summarizer to summarize the set of news articles into one short document. The summarized document is fed into a language model, and the transformed representation of the [CLS] token is used for prediction.

6 Evaluation on ForecastQA

In this section, we describe experimental setups and experimental results.

6.1 Experimental Setup

Before diving into our main task, we first analyze the difficulty of our dataset given gold articles to show the validity of our dataset. Then, we evaluate our method in an open-domain setting using BM25 Qi et al. (2019), which is to examine the difficulty of the ForecastQA dataset with retrieved articles. In this setting, models should answer the question given past news articles where models cannot directly infer answers. This is because we are testing a temporal reasoning ability of the model. If answers are already in the article, then it becomes a simple language understanding task.

We also test our model with manually annotated articles. We retrieve 10 articles for the first setting and use BERT-base and RoBERTa-base. Random guessing is a random predictor which is defined by the uniform distribution.

Evaluation metrics. To evaluate performances, we adopt accuracy and Brier scores. The Brier score measures the accuracy of probabilistic prediction; it is the mean squared error of the probabilities. So formal definition of the Brier score is:


where is the probability of prediction, is a label of the instance, is the prediction instances, and is the number of classes. The lower the Brier score is, the better the performance is.

6.2 Performance Study and Analysis

In this section, we first analyze the validity of our dataset and discuss experimental results on an open-domain setting and on manually annotated articles.

Analysis of the Difficulty of ForecastQA. Now we examine our dataset by testing our dataset with pre-trained models, BERT Liu et al. (2019) and RoBERTa Liu et al. (2019)

with different types of inputs. We use a fully connected layer with a sigmoid activation function on top of the vector corresponding to the

token. We vary the input: article titles (T), article contents (C), titles and contents (T,C), evidence (E), only questions. Evidence is a sentence from gold articles, and questions are made from evidence; the questions can be answered by evidence. As shown in Table 6, RoBERTa with evidence exhibits the best performances among other variants, which indicates the pre-trained models can achieve good performances given evidence. From the comparison between BERT with titles (T) and BERT with contents (C), contents may have more useful information to answer the question than titles. Interestingly, BERT and RoBERTa with only questions show decent performances, which is better than the random guessing. This analysis suggests that our dataset is not an easy task and proves the validity of our dataset.333We will update human performances in the future update.

Methods / Metrics Accuracy Brier score
yes/no multi all yes/no multi all
+GRU 0.502 0.355 0.436 0.565 0.751 0.647
+GRU+Time encoding 0.501 0.370 0.442 0.567 0.751 0.649
+Maxpool 0.630 0.598 0.616 0.550 0.523 0.538
+Summarization 0.623 0.488 0.563 0.588 0.631 0.607
+GRU 0.501 0.327 0.423 0.583 0.751 0.657
+GRU+Time encoding 0.501 0.330 0.424 0.582 0.749 0.656
+Maxpool 0.515 0.549 0.530 0.505 0.557 0.528
+Summarization 0.628 0.443 0.546 0.502 0.722 0.599
Random Guessing 0.478 0.238 0.372 0.705 0.819 0.755
Table 7: Results on retrieved articles on the ForecastQA test set. Yes/no refers to yes-no questions and multi refers to multi-choice questions. Given the retrieved articles by BM25, we also assess the various models based on pre-trained language models, BERT and RoBERTa. We add additional aggregator to the language models as discussed in Section 5.
Methods / Metrics Accuracy
yes/no multi all
+BM25 0.630 0.598 0.616
+TF-IDF 0.622 0.441 0.542
+BM25 0.515 0.549 0.530
+TF-IDF 0.509 0.330 0.430
Table 8: Results on different retrieval models, BM25 and TF-IDF.
Methods / Metrics Accuracy
yes/no multi all
Annotated articles 0.631 0.677 0.650
BM25 0.625 0.672 0.644
Table 9: Results on annotated articles. We compare results with the annotated articles and retrieved articles with BM25, and use a BERT+Maxpool method. We manually curate articles for a subset of questions and test the performances.

Experiments on an Open-domain Setting. Table 7 shows the results of the proposed approaches with a BM25 retriever. We compare the pre-trained language models with various aggregate functions. We find that BERT with the maxpool operator shows the best performances. Also it outperforms BERT with GRU, which is probably because the set of articles may not show a meaningful sequence. Instead, the maxpool operator picks important information from the set of articles. We notice that RoBERTa with GRU+Time encoding does not make differences from RoBERTa with GRU, suggesting that sophisticated modeling is required. The MMR summarizer helps compressing the given articles and thus BERT with summarization achieves similar performances to BERT+Maxpool.

In addition, we test retrieval methods: BM25 Qi et al. (2019) and TF-IDF Chen et al. (2017). As is shown in the Table 8, the BM25 retriever shows the better performances than the TF-IDF retriever for BERT and RoBERTa.

Experiments with Manually Annotated Articles. Table 9 shows the results on manually annotated articles. We annotated articles for a subset of questions to show how helpful the curated articles. 82.3% questions have 4 or less curated articles as shown in Table 4, while we use retrieved 10 articles for the counterpart. To avoid information loss, we also exploit additional retrieved articles to the curated ones, and thus each questions have 10 articles including annotated articles. As is shown, BERT+Maxpool with manually annotated articles slightly outperforms with BM25.

6.3 Error Analysis

We randomly select 50 errors made by our best approach from the test set, and identify 4 phenomena:

Wrong Retrieved Articles. In 28% of the errors, wrong articles are retrieved and models cannot find supporting facts from articles. Our approach relies on information retrieval models such as BM25. Some retrieved articles do not have supporting facts for the questions, and thus our model predicts wrong answers. To resolve this error, we can adopt improved retrieval models.

Finding Wrong Evidence. 24% errors are due to finding wrong evidence. Even though appropriate articles are retrieved, model find a wrong sentence as a support fact. For example, there is a question, “Who will be responsible for the Pentagon cloud lawsuit in November 2019?” The model predicted an answer, Amazon, from a sentence “Amazon will deliver the pentagon to the cloud.” However, this is not supporting evidence to the question.

Lacking Human Common Sense. In 32% of the errors, the model selected the wrong choice since it lacks common sense or world knowledge. For example, we have a question, “Which European city will not be part of Jet Airways long haul suspended service cities by April 2019?”. To answer this question, the model should know the country of the city. Lack of the knowledge causes the wrong answer.

Ambiguous Questions. 12% errors are from ambiguous questions. Some questions are not clear and it can have multiple answers depending on time. For example, a question, ”What will Pope Francis apologize for in September 2019?”, might have multiple answers depending on a time.

7 Conclusion

To test the temporal reasoning ability of current question answering approaches, we introduce ForecastQA, a forecasting question answering dataset on news articles with crowdsourced question-answer pairs. The task is inherently challenging as the questions require temporal reasoning and complex compositions of various types of reasoning such as commonsense, multi-hop reasoning. Most widely used baseline methods including BERT and RoBERTa do not show desirable performances. We believe our benchmark dataset can benefit future research beyond natural language understanding and expect the performances will be significantly improved.


  • J. Carbonell and J. Goldstein (1998) The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. Cited by: §5.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In ACL, Cited by: §6.2.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §5, §5.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL-HLT, Cited by: Table 1, §1, §4.2.
  • L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi (2019) Cosmos qa: machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277. Cited by: Table 1.
  • Z. Jia, A. Abujabal, R. S. Roy, J. Strötgen, and G. Weikum (2018a) TempQuestions: a benchmark for temporal question answering. In WWW, Cited by: §2.
  • Z. Jia, A. Abujabal, R. S. Roy, J. Strötgen, and G. Weikum (2018b) TEQUILA: temporal question answering over knowledge bases. ArXiv abs/1908.03650. Cited by: §2.
  • T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: Table 1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1, §1, §5, §6.2.
  • Y. Nie, S. Wang, and M. Bansal (2019) Revealing the importance of semantic retrieval for machine reading at scale. In EMNLP/IJCNLP, Cited by: §1.
  • P. Qi, X. Lin, L. Mehr, Z. Wang, and C. D. Manning (2019) Answering complex open-domain questions through iterative query generation. In EMNLP/IJCNLP, Cited by: §3.3, §6.1, §6.2.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In ACL, Cited by: Table 1, §2, §4.2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, Cited by: §1, §2, §4.2.
  • M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi (2019) Social iqa: commonsense reasoning about social interactions. In EMNLP 2019, Cited by: §2, §4.2.
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2016) Bidirectional attention flow for machine comprehension. ArXiv abs/1611.01603. Cited by: §1.
  • Y. Sun, G. Cheng, and Y. Qu (2018) Reading comprehension with graph-based temporal-casual reasoning. In COLING, Cited by: §2.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In NAACL-HLT, Cited by: §1.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2018) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In NAACL-HLT, Cited by: Table 1, §2, §4.2.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2016) NewsQA: a machine comprehension dataset. In Rep4NLP@ACL, Cited by: Table 1, §1, §2.
  • J. Welbl, P. Stenetorp, and S. Riedel (2017) Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In EMNLP, Cited by: Table 1, §1, §1, §2, §4.2.
  • B. Zhou, D. Khashabi, Q. Ning, and D. Roth (2019) ”Going on a vacation” takes longer than ”going for a walk”: a study of temporal commonsense understanding. In EMNLP/IJCNLP, Cited by: Table 1, §2, §4.2.