GeoSQA: A Benchmark for Scenario-based Question Answering in the Geography Domain at High School Level

08/20/2019 ∙ by Zixian Huang, et al. ∙ Nanjing University 0

Scenario-based question answering (SQA) has attracted increasing research attention. It typically requires retrieving and integrating knowledge from multiple sources, and applying general knowledge to a specific case described by a scenario. SQA widely exists in the medical, geography, and legal domains---both in practice and in the exams. In this paper, we introduce the GeoSQA dataset. It consists of 1,981 scenarios and 4,110 multiple-choice questions in the geography domain at high school level, where diagrams (e.g., maps, charts) have been manually annotated with natural language descriptions to benefit NLP research. Benchmark results on a variety of state-of-the-art methods for question answering, textual entailment, and reading comprehension demonstrate the unique challenges presented by SQA for future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scenario-based question answering (SQA) is an emerging application of NLP Lally et al. (2017). Different from traditional QA, a question in SQA is accompanied by a scenario, e.g., a patient summary in the medical domain asking for diagnosis or treatment. A scenario differs from a document given in the reading comprehension task where the answer can be extracted or abstracted from the document Rajpurkar et al. (2016); Nguyen et al. (2016); Lai et al. (2017). SQA requires retrieving and integrating knowledge from multiple sources, and applying general knowledge to a specific case described by the scenario.

SQA has found application in many fields, especially in the legal domain Ye et al. (2018); Luo et al. (2017); Zhong et al. (2018) and in high-school geography exams Ding et al. (2018); Zhang et al. (2018). The latter is particularly challenging because a geographical scenario consists of both text and diagrams (e.g., maps, charts). Questions include city planning, climates, agriculture planning, transportation, etc. An example of a scenario and a question is presented in Figure 1.

Geographical SQA has posed great challenges to NLP and related research, ranging from scenario understanding to cross-modal knowledge integration and reasoning. However, there is a lack of large datasets and benchmarking efforts for this task. In this paper, we introduce GeoSQA—an SQA dataset in the geography domain consisting of 1,981 scenarios and 4,110 multiple-choice questions at high school level. In particular, each diagram has been manually annotated with a high-quality natural language description of its content, as illustrated in Figure 1. This labor-intensive effort significantly extends the use of GeoSQA, which can support visual SQA (using the diagrams), natural language based SQA (using the annotations of diagrams), and even the diagram-to-text research. We test the effectiveness of a variety of methods for question answering, textual entailment, and reading comprehension on GeoSQA. The results demonstrate its unique challenges, waiting for more effective solutions.

The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 describes the GeoSQA dataset. Section 4 reports benchmark results. Section 5 concludes the paper.

Figure 1: An example of a scenario, a question, and diagram annotations.

2 Related Work

2.1 Scenario-based Question Answering

Scenario-based question answering (SQA) is introduced by Lally et al. Lally et al. (2017), where the WatsonPaths system is presented to answer questions that describe a medical scenario about a patient and ask for diagnosis or treatment. SQA also finds application in the legal domain, where a legal case describes a scenario to be decided Ye et al. (2018); Luo et al. (2017); Zhong et al. (2018).

For some domains, reasoning with domain knowledge is essential to SQA. Therefore, such questions often appear in exams like China’s version of the SAT called Gaokao. For example, for the geography domain, Ding et al. Ding et al. (2018) and Zhang et al. Zhang et al. (2018)

construct a knowledge graph to support answering scenario-based geography questions at high school level.

2.2 Related Datasets

There are many datasets for traditional QA, such as WebQuestions Berant et al. (2013) and WikiQA Yang et al. (2015). A closely related task is reading comprehension, where the answer to a question is extracted or abstracted from a given document Rajpurkar et al. (2016); Nguyen et al. (2016); Lai et al. (2017). By comparison, SQA is arguably more difficult because a scenario is present and contextualizes a question, but no direct answer can be identified from the scenario.

The GeoSQA dataset introduced in this paper is not the first resource for geographical SQA. Ding et al. Ding et al. (2018) and Zhang et al. Zhang et al. (2018) have made their datasets public. However, compared with GeoSQA, their datasets are small and, more importantly, they ignore diagrams which represent a unique challenge to geographical SQA. By contrast, diagrams are included in GeoSQA for completeness, and have been manually annotated with natural language descriptions for extended use—including but not limited to NLP research.

Existing SQA datasets for other domains include the TREC Precision Medicine track Roberts et al. (2018) for the medical domain, and CAIL Xiao et al. (2018) for the legal domain. However, SQA in the geography domain requires different forms of knowledge and different reasoning capabilities, and has posed different research challenges.

3 The GeoSQA Dataset

GeoSQA is an SQA dataset in the geography domain, containing 1,981 scenarios and 4,110 multiple-choice questions at high school level. A scenario consists of a piece of text and a diagram, supporting 1–5 questions. A diagram is annotated with a natural language description of its content. A question has four options that are possible answers. Exactly one of them is the correct answer. The dataset is available online111ws.nju.edu.cn/gaokao/geosqa/1.0/.

3.1 Data Collection and Deduplication

We crawled over 6,000 scenarios and 13,000 questions from Gaokao and mock tests that are available on the Web. However, some scenarios are just copies or trivial variants of others. There is a need to clean and deduplicate the collected data.

Method. The problem is to decide whether a pair of scenarios are (near) duplicates or not.

We firstly establish a matching between their structures. The matching consists of 6 pairs of their text elements: 1 pair of their scenario text, 1 pair of their most similar questions, and 4 pairs of the most similar options of the above two questions. Text similarity is computed by the cosine similarity between two bags of words.

Then we extend a popular text matching method called MatchPyramid Pang et al. (2016)

to classify a pair of scenarios as duplicates or not. The original implementation of MatchPyramid can only process a pair of text. We extend it to process all the 6 pairs of text by concatenating their feature vectors inside MatchPyramid.

Experiments. To evaluate our method, we manually label 1,000 pairs of scenarios where positive and negative examples are balanced. The set is divided into training, validation, and test sets with a 60-20-20 split. Our method achieves an accuracy of 95.3% on the test set, showing its satisfying performance.

Then we apply our method to the entire dataset. We index all the scenarios using Apache Lucene. For each scenario, we retrieve 10 top-ranked scenarios as suspect duplicates. Each pair of scenarios is classified by our method, which is trained using all the 1,000 labeled examples.

To verify the quality of the final results, we randomly sample and manually check 100 pairs of scenarios that are predicted to be duplicates. Indeed, all of them are decided correctly. We also randomly sample 50 scenarios and, for each of them, we retrieve and manually check 10 top-ranked scenarios that are predicted to be non-duplicates. Only 6% are decided incorrectly, suggesting a low degree of redundancy of our data.

3.2 Diagram Annotation

Crawled diagrams are images. To extend the use of GeoSQA and to better support NLP research, we manually annotate each diagram with a high-quality natural language description of its content so that NLP researchers can use these text annotations instead of the original diagrams.

Annotation. We recruited 30 undergraduate students from one of the top-ranked universities in China as annotators. All of them had an excellent record in geography during high school.

Each diagram is assigned to one annotator, who also has access to the scenario text and related questions. The annotator firstly categorizes the diagram according to a hierarchy of categories. Each category is associated with a set of text templates that are recommended to be used in annotations as far as possible. However, the annotator is free to use any form of text to annotate information that is not covered by the provided templates.

Annotations are required to precisely reflect the content of the diagram. All the information related to every supported question and every option should be annotated. On the other hand, inferring new knowledge via human reasoning is prohibited.

Note that the entire annotation process is designed to be iterative. The 22 diagram categories and 81 text templates are not predefined but incrementally induced during the experiment. However, there are still 11% of the diagrams that are believed to not belong to any category. No templates are provided for their annotations.

An example of annotations is shown in Figure 1.

Audit. To ensure the quality of the annotations, we recruited 3 senior annotators to audit the results. Each diagram is audited by one senior annotator, who rates the annotations from three dimensions in the range of 1–5.

  • Sufficiency: The annotations cover all the necessary information in the diagram that is useful for answering related questions.

  • Fairness: The annotations are not biased towards any particular option of a question.

  • Objectiveness: The annotations are plain descriptions of the diagram—not influenced by human reasoning.

The scenarios where the annotations of the diagram are rated below 3 in any dimension are excluded from the dataset.

4 Benchmark Results

We tested several state-of-the-art methods for question answering, textual entailment, and reading comprehension on our GeoSQA dataset.

4.1 Corpora

We use two corpora as background knowledge. Textbooks contains 15K sentences extracted from two high-school geography textbooks. Wikipedia contains 1M articles in the latest Chinese edition of Wikipedia. We index their sentences using Apache Lucene.

4.2 Methods

We tested two text matching methods. In IR Clark et al. (2016), for each option, we use a combination of the scenario text, the question, and the option as a query, to retrieve the top-ranked sentence from a corpus. We use the ranking score of this sentence as the score of the option. Finally, we choose the option with the highest score as the answer. In PMI Clark et al. (2016)

, for each option, we calculate the Pointwise Mutual Information (PMI) between the question and the option as the score of the option. Finally, we choose the option with the highest score as the answer. Probabilities in PMI are estimated based on a corpus.

We tested four textual entailment methods: ESIM Chen et al. (2017), DIIN Gong et al. (2018), BERT Devlin et al. (2018), and BiMPM Wang et al. (2017). The first three methods were trained on the XNLI dataset Conneau et al. (2018). The last method was trained on the LCQMC dataset Liu et al. (2018). For each option, a textual entailment method retrieves six top-ranked sentences from a corpus to form the entailing text. Retrieval follows the procedure described in the above-mentioned IR method. The scenario text and diagram annotations may or may not be included in the entailing text, depending on the configuration. A combination of the question and the option form the entailed text. Finally, we choose the option with the highest entailment score as the answer.

We tested one reading comprehension method: BERT Devlin et al. (2018). It was trained on the DuReader dataset He et al. (2018). For each option, a reading comprehension method retrieves six top-ranked sentences from a corpus as part of the passage for reading comprehension. Retrieval follows the procedure described in the above-mentioned IR method. The scenario text and diagram annotations may or may not be included in the passage, depending on the configuration. Finally, the reading comprehension method extracts a text span from the passage. We choose the option that is the most similar to the extracted text span as the answer. Similarity is computed by the cosine similarity between two bags of words.

4.3 Results

The results are summarized in Table 1. Note that a question has four options. Even guessing randomly, the expected proportion of correctly answered questions would be 25%.

Almost all the methods performed similar to random guess, showing that SQA on our dataset has its unique challenges.

Textbooks Wikipedia
IR 25.24 25.14
PMI 26.22 25.19
ESIM w/o scenario 25.85 25.41
ESIM w/ scenario 24.34 24.41
DIIN w/o scenario 24.15 25.20
DIIN w/ scenario 25.11 24.89
BERT w/o scenario 24.29 24.17
BERT w/ scenario 24.97 24.68
BiMPM w/o scenario 24.13 24.51
BiMPM w/ scenario 24.76 23.81
BERT w/o scenario 24.81 24.78
BERT w/ scenario 23.66 23.01
Table 1: Proportions of correctly answered questions.

4.4 Discussion

To explain the poor performance of existing methods, we have identified the following challenges. First, SQA relies on domain knowledge that is not provided in the scenario. However, relevant knowledge may fail to be retrieved from the corpus. Second, for some questions, commonsense knowledge is needed but is not included in textbooks and may fail to be retrieved from Wikipedia. Third, the retrieved general knowledge needs to be applied to the specific case described by a scenario. Existing QA and reading comprehension methods hardly have this capability.

5 Conclusion

We have contributed GeoSQA—a large SQA dataset where diagrams are present and have been manually annotated with natural language descriptions. We have tested a variety of existing methods on our dataset. The results are not satisfactory, thus demonstrating the unique challenges presented by the SQA task on our dataset. In future work, we will work towards more effective solutions to meet the challenges.

Researchers are invited to use GeoSQA to support their own tasks, including but not limited to natural language based SQA, visual SQA, and the diagram-to-text task.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant 2018YFB1005100, and in part by the NSFC under Grant 61572247. Gong Cheng was supported by the Qing Lan Program of Jiangsu Province. We would like to thank all the annotators.

References

  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL

    ,
    pp. 1533–1544. Cited by: §2.2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1657–1668. External Links: Document Cited by: §4.2.
  • P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi (2016) Combining retrieval, statistics, and inference to answer elementary science questions. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

    ,
    pp. 2580–2586. Cited by: §4.2.
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2475–2485. Cited by: §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §4.2, §4.2.
  • J. Ding, Y. Wang, W. Hu, L. Shi, and Y. Qu (2018) Answering multiple-choice questions in geographical gaokao with a concept graph. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, pp. 161–176. External Links: Document Cited by: §1, §2.1, §2.2.
  • Y. Gong, H. Luo, and J. Zhang (2018) Natural language inference over interaction space. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §4.2.
  • W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu, Q. She, X. Liu, T. Wu, and H. Wang (2018) DuReader: a chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering@ACL 2018, Melbourne, Australia, July 19, 2018, pp. 37–46. Cited by: §4.2.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy (2017) RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 785–794. Cited by: §1, §2.2.
  • A. Lally, S. Bagchi, M. Barborak, D. W. Buchanan, J. Chu-Carroll, D. A. Ferrucci, M. R. Glass, A. Kalyanpur, E. T. Mueller, J. W. Murdock, S. Patwardhan, and J. M. Prager (2017) WatsonPaths: scenario-based question answering and inference over unstructured information. AI Magazine 38 (2), pp. 59–76. Cited by: §1, §2.1.
  • X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, and B. Tang (2018) LCQMC: A large-scale chinese question matching corpus. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 1952–1962. Cited by: §4.2.
  • B. Luo, Y. Feng, J. Xu, X. Zhang, and D. Zhao (2017) Learning to predict charges for criminal cases with legal basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2727–2736. Cited by: §1, §2.1.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016., Cited by: §1, §2.2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng (2016) Text matching as image recognition. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pp. 2793–2799. Cited by: §3.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2383–2392. Cited by: §1, §2.2.
  • K. Roberts, D. Demner-Fushman, E. M. Voorhees, W. R. Hersh, S. Bedrick, and A. J. Lazar (2018) Overview of the trec 2018 precision medicine track. In Proceedings of The Twenty-Seventh Text REtrieval Conference, TREC 2018, Gaithersburg, Maryland, USA, November 14-16, 2018, Cited by: §2.2.
  • Z. Wang, W. Hamza, and R. Florian (2017) Bilateral multi-perspective matching for natural language sentences. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 4144–4150. External Links: Document Cited by: §4.2.
  • C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, and J. Xu (2018) CAIL2018: A large-scale legal dataset for judgment prediction. CoRR abs/1807.02478. Cited by: §2.2.
  • Y. Yang, W. Yih, and C. Meek (2015) WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 2013–2018. Cited by: §2.2.
  • H. Ye, X. Jiang, Z. Luo, and W. Chao (2018) Interpretable charge predictions for criminal cases: learning to generate court views from fact descriptions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 1854–1864. Cited by: §1, §2.1.
  • Z. Zhang, L. Zhang, H. Zhang, W. He, Z. Sun, G. Cheng, Q. Liu, X. Dai, and Y. Qu (2018) Towards answering geography questions in gaokao: a hybrid approach. In Proceedings of the Third China Conference on Knowledge Graph and Semantic Computing, CCKS 2018, Tianjin, China, 14-17 August 2018, pp. 1–13. External Links: Document Cited by: §1, §2.1, §2.2.
  • H. Zhong, Z. Guo, C. Tu, C. Xiao, Z. Liu, and M. Sun (2018) Legal judgment prediction via topological learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 3540–3549. Cited by: §1, §2.1.