Repartitioning of the ComplexWebQuestions Dataset

07/25/2018 ∙ by Alon Talmor, et al. ∙ Tel Aviv University 0

Recently, Talmor and Berant (2018) introduced ComplexWebQuestions - a dataset focused on answering complex questions by decomposing them into a sequence of simpler questions and extracting the answer from retrieved web snippets. In their work the authors used a pre-trained reading comprehension (RC) model (Salant and Berant, 2018) to extract the answer from the web snippets. In this short note we show that training a RC model directly on the training data of ComplexWebQuestions reveals a leakage from the training set to the test set that allows to obtain unreasonably high performance. As a solution, we construct a new partitioning of ComplexWebQuestions that does not suffer from this leakage and publicly release it. We also perform an empirical evaluation on these two datasets and show that training a RC model on the training data substantially improves state-of-the-art performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 ComplexWebQuestions Dataset

ComplexWebQuestions is a recently introduced Question Answering (QA) dataset Talmor and Berant (2018). Each example in ComplexWebQuestions is a triple , where is a question, is the correct answer (with aliases) and is a document, which contains a list of web snippets retrieved by a base model while attempting to answer the question. The dataset can be used by interacting with a search engine to find web snippets, or by using the pre-retrieved snippets. Table 1 provides a few examples for questions from the dataset.

“What films star Taylor Lautner and have costume designs by Nina Proctor?”
“Which school that Sir Ernest Rutherford attended has the latest founding date?”
“Which of the countries bordering Mexico have an army size of less than 1050?”
“Where is the end of the river that originates in Shannon Pot?”
Table 1: Example questions from ComplexWebQuestions.

ComplexWebQuestions was created by taking examples from the dataset WebQuestionsSP Yih et al. (2016), which contains 4,737 questions paired with SPARQL queries for Freebase Bollacker et al. (2008). In WebQuestions, questions are broad but simple. Thus, the authors sampled question-query pairs, automatically created more complex SPARQL queries with manually-defined rules, generated automatically questions that are understandable to Amazon Mechanical Turk workers, and then had them paraphrased into natural language (similar to wang2015overnight). They computed answers by executing complex SPARQL queries against Freebase, and obtained broad and complex questions. Figure 1 provides an overview of this procedure.

Figure 1: Overview of data collection procedure. Blue text denotes different stages of the term addition, green represents the obj value, and red the intermediate text to connect the new term and seed question.

2 Partitioning Issue

Complex Question Answer
“Who was the president in 1980 of the nation that uses the Pakistani repuu as money?” “Muhammad Zia-ul-Haq”
“The country that contains Balochistan, Pakistan had what President in 1980?” “Muhammad Zia-ul-Haq”
“Who held the office of president in 1980 in the country that has Islamabad as its capital?” “Muhammad Zia-ul-Haq”
Table 2: Example of a seed question “Who was the president of Pakistan in 1980?” and three questions that were generated from it.

The original ComplexWebQuestions was created by generating 34,689 examples and then randomly partitioning them into disjoint training, development and test sets. Because every seed question from WebQuestions produces multiple questions in ComplexWebQuestions

, questions that originate from the same seed question will be placed in both the training set and test set with high probability. Consider the example in Table 

2 of a seed question and three questions that were generated from it.

As can be seen in this example, if one of the generated questions is placed in the training set and another in the test set, the model can learn a spurious correlation between the terms “1980” and “Pakistan” to “Muhammad Zia-ul-Haq”. Naturally, this is not a desired behavior.

The model presented by talmor2018web did not train a RC model on ComplexWebQuestions and instead used a model that was pre-trained on the SQuAD dataset Rajpurkar et al. (2016). Thus, their model did not learn any of the spurious correlations mentioned above. In this work, we train a RC directly on ComplexWebQuestions and find that the model indeed learns these correlations and thus spurious information is leaked from the training set to the test set.

The solution is simple – we repartition the ComplexWebQuestions dataset based on the seed questions, that is, we randomly split the original WebQuestions questions into a training, development and test set. Thus, no question in the test set (or development set) originates from the same seed question as a question from the training set. We train a RC model on this new dataset and show that its performance is substantially lower compared to the original partitioning. Nevertheless, training a RC model on ComplexWebQuestions improves performance compared to the pre-trained model and thus we establish a new state-of-the-art on ComplexWebQuestions.

We name the new dataset ComplexWebQuestions version 1.1 (and the old dataset version 1.0), and it can be downloaded from

3 Experiments

In this section we show how training a RC model on ComplexWebQuestions results in high performance on ComplexWebQuestionsversion 1.0, and much lower performance on ComplexWebQuestionsversion 1.1. As a side effect, we also report experiments on different ways of training the RC model on ComplexWebQuestions.

Experimental setup

The QA model of talmor2018web has two parts. First, a question decomposition model determines how to decompose the complex question into a sequence of simpler questions. Second, each question is sent to a search engine, web snippets are retrieved, and a RC model extracts the answer from the snippets. Thus the model is specified by (i) a question decomposition procedure (ii) a RC model. talmor2018web propose two decomposition procedures, which we also evaluate.

  1. SimpQA: A question is not decomposed but sent as is to the search engine.

  2. SplitQA: A question decomposition model decomposes the question and then re-composes the final answer.

talmor2018web used a single RC model in their work that was pre-trained on SQuAD. Here, we evaluate three RC models

  1. Pretained: The same model from talmor2018web.

  2. NoDecomp: We train the DocumentQA RC model Clark and Gardner (2017) on all question-answer pairs from ComplexWebQuestions, where for each question we provide Google web snippets when using the entire question. The total number of examples DocumentQA is trained on is 24,649.

  3. Decomp: We train DocumentQA on all examples from NoDecomp. However, due to the generation process of ComplexWebQuestions

    , talmor2018web presented a method for heuristically decomposing complex questions into two simpler questions for which the answer can be computed from Freebase. Thus, we can train

    DocumentQA not only on the original complex questions, but also on the decomposed questions by sending them to a search engine and retrieving the snippets. This is important for SplitQA, which applies the RC model on simple rather than complex questions. In total, we train DocumentQA on 63,263 examples, derived from the ComplexWebQuestions training set.

For evaluation, we measure precision@1, the official evaluation metric of

ComplexWebQuestions. We now present empirical results for various combinations of a questions decomposition model and a RC model.

3.1 Results

System Dev. Test
SimpQA+Pretrained 20.4 20.8
SplitQA+Pretrained 29.0 27.5
SimpQA+NoDecomp 47.8 -
SplitQA+NoDecomp 55.0 -

Table 3: precision@1 results on the development set and test set for ComplexWebQuestions version 1.0
System Dev. Test
SimpQA+Pretrained 20.5 19.9
SplitQA+Pretrained 27.6 25.9
SimpQA+NoDecomp 30.6 -
SplitQA+NoDecomp 31.1 -
SplitQA+Decomp 35.6 34.2

Table 4: precision@1 results on the development set and test set for ComplexWebQuestions version 1.1

Tables 3 and 4 present the results of our evaluation on both versions of ComplexWebQuestions and various models.

Results of SimpQA+NoDecomp and SplitQA+NoDecomp on version 1.0 show a significant increase in accuracy on the development set of 47.8 and 55.0 respectively. However, when comparing results of these models when trained on version 1.1, results are much lower – 30.6 and 31.1 respectively. Conversely, SimpQA+Pretrained, that has not been trained on these training sets, remains uneffected by the repartitioning, and retains a similar precision@1 of 20.5 and 20.4. This demonstrates that version 1.0 did in fact enable the model to learn spurious correlations to achieve a much higher accuracy whereas in version 1.1 this is not the case.

Overall, results of SimpQA+NoDecomp and SplitQA+NoDecomp are higher compared to the pre-trained RC model, showing that training a RC model on the training set improves performance. A further increase in accuracy in SplitQA from 31.1 to 35.6 on the development set is achieved by training the model on the full questions as well as decomposed questions as shown in SplitQA+Decomp, implying that adding the decomposed questions improves performance as well. Finally, we run SplitQA+Decomp once on the test set, to achieve precision@1 score of 34.2, establishing a new state-of-the-art on ComplexWebQuestions.

4 Conclusion

In this short note we describe a problem with the partitioning of the ComplexWebQuestions dataset, which was exposed by training a RC model on this dataset. We solve this problem by re-partitioning the dataset, present an empirical evaluation, and report a new state-of-the-art on ComplexWebQuestions. The new dataset is publicly available on the ComplexWebQuestions website.


  • Bollacker et al. (2008) K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In International Conference on Management of Data (SIGMOD). pages 1247–1250.
  • Clark and Gardner (2017) C. Clark and M. Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723 .
  • Rajpurkar et al. (2016) P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In

    Empirical Methods in Natural Language Processing (EMNLP)

  • Salant and Berant (2018) S. Salant and J. Berant. 2018. Contextualized word representations for reading comprehension. In North American Association for Computational Linguistics (NAACL).
  • Talmor and Berant (2018) A. Talmor and J. Berant. 2018. The web as knowledge-base for answering complex questions. In North American Association for Computational Linguistics (NAACL).
  • Wang et al. (2015) Y. Wang, J. Berant, and P. Liang. 2015. Building a semantic parser overnight. In Association for Computational Linguistics (ACL).
  • Yih et al. (2016) W. Yih, M. Richardson, C. Meek, M. Chang, and J. Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Association for Computational Linguistics (ACL).