Log In Sign Up

Improving Question Answering with External Knowledge

by   Xiaoman Pan, et al.

Prior background knowledge is essential for human reading and understanding. In this work, we investigate how to leverage external knowledge to improve question answering. We primarily focus on multiple-choice question answering tasks that require external knowledge to answer questions. We investigate the effects of utilizing external in-domain multiple-choice question answering datasets and enriching the reference corpus by external out-domain corpora (i.e., Wikipedia articles). Experimental results demonstrate the effectiveness of external knowledge on two challenging multiple-choice question answering tasks: ARC and OpenBookQA.


page 1

page 2

page 3

page 4


Analysis of Wikipedia-based Corpora for Question Answering

This paper gives comprehensive analyses of corpora based on Wikipedia fo...

Knowledge-based Embodied Question Answering

In this paper, we propose a novel Knowledge-based Embodied Question Answ...

Bridging the Knowledge Gap: Enhancing Question Answering with World and Domain Knowledge

In this paper we present OSCAR (Ontology-based Semantic Composition Augm...

WikiDoMiner: Wikipedia Domain-specific Miner

We introduce WikiDoMiner, a tool for automatically generating domain-spe...

Correction of Faulty Background Knowledge based on Condition Aware and Revise Transformer for Question Answering

The study of question answering has received increasing attention in rec...

Semantic Parsing to Probabilistic Programs for Situated Question Answering

Situated question answering is the problem of answering questions about ...

A Few More Examples May Be Worth Billions of Parameters

We investigate the dynamics of increasing the number of model parameters...

1 Introduction

External knowledge plays a critical role in human reading and understanding since authors assume readers have a certain amount of background knowledge gained from sources outside the text McNamara et al. (2004); Salmerón et al. (2006); Zhang and Seepho (2013).

A growing number of studies concentrate on the construction of multiple-choice machine reading comprehension Mostafazadeh et al. (2016); Lai et al. (2017); Khashabi et al. (2018); Ostermann et al. (2018); Sun et al. (2019) or question answering tasks Clark et al. (2018); Mihaylov et al. (2018). For machine reading comprehension tasks, the majority of the questions are still designed to be answerable based on the content of the provided reference documents. In this paper, we focus on multiple-choice question answering tasks: only a reference corpus is provided, and we require diverse types of knowledge to select the correct answer options Clark et al. (2018).

It is still an open problem how to exploit external knowledge for multiple-choice question answering to replete with the knowledge gaps between humans and machines. Very recent studies Peters et al. (2018); Radford et al. (2018); Devlin et al. (2018) leverage rich world knowledge by pre-training deep neural models such as LSTMs and Transformers Vaswani et al. (2017); Liu et al. (2018) using language model objectives over large-scale corpora (e.g., BookCorpus Zhu et al. (2015)

and Wikipedia articles). We have seen significant improvements obtained on a wide range of natural language processing tasks by fine-tuning these pre-trained models on a downstream task. However, it is relatively time-consuming and resource-extensive to introduce external knowledge during the pre-training stage.

In this paper, we aim to utilize external knowledge to improve multiple-choice question answering during the fine-tuning stage. We investigate the effects of 1) augmenting training data by using external in-domain question answering datasets; 2) enriching reference corpora by retrieving additional knowledge from external open-domain resources via conducting entity discovery and linking based on questions and answer options.

We conduct preliminary experiments on two challenging multiple-choice question answering tasks collected from examinations – ARC Clark et al. (2018) and OpenBookQA Mihaylov et al. (2018) – by using BERT Devlin et al. (2018) as the underlying question answering model. Experimental results we can obtain promising results by leveraging external knowledge.

2 Method

In this section, we first introduce the underlying question answering baseline we use (Section 2.1). We then present two methods to introduce external in-domain (Section 2.2) and open-domain (Section 2.3) knowledge.

2.1 Basic Framework

By default, we employ the following framework unless explicitly specified. Following Sun et al., we first fine-tune a pre-trained language model on the largest multiple-choice machine reading comprehension dataset RACE Lai et al. (2017) and then fine-tune the resulting model on target multiple-chocie question answering datasets. In this paper, we use BERT Devlin et al. (2018) as the pre-trained language model.

Given question , answer option , and reference document , we concatenate them with special tokens and as the input sequence for by , where and stand for the [CLS] and [SEP] respectively in BERT. We add segmentation embedding A to every token before (exclusive) and B to the other tokens. For instances in ARC and OpenBookQA, comes from the concatenation of the top sentences retrieved by Lucene McCandless et al. (2010) from their corresponding reference corpus with non-stop words in and as the query Sun et al. (2018)

. The final prediction for each question is obtained by a linear plus softmax layer over the output of the final hidden state for the first token of each input sequence. We refer readers to

Devlin et al.; Sun et al. for more details.

2.2 Utilization of In-Domain Data

Our basic framework consists of two stages: fine-tuning a pre-trained language model on a large-scale open-domain machine reading comprehension dataset (i.e., RACE) and then fine-tuning the resulting neural reader on target question answering datasets. For the latter step, instead of fine-tuning a neural reader on a single target dataset Sun et al. (2018), we also investigate into fine-tuning a neural reader on multiple target datasets simultaneously.

Task Previous Single-Model STOA BERT BERT + EDL BERT + EDL + MD Ensemble
ARC-Easy 66.6 71.9 72.9 71.1 76.5
ARC-Challenge 40.7 44.1 46.1 51.8 53.8
OpenBookQA 55.2 64.8 67.0 68.0 69.6
Table 1:

Accuracy (%) on the test sets of ARC and OpenBookQA. RACE is used as the source task of transfer learning for all the tasks. MD stands for fine-tuning on

multiple target datasets simultaneously (Section 2.2). Previous state-of-the-art (STOA) results come from Sun et al. (2018). All results except the last column are single-model performance. See Appendix A for details about the ensemble methods.

2.3 Utilization of Open-Domain Data

We use entity discovery and linking (EDL) to help us enrich the reference documents.

Entity discovery is a task that extracts entity mentions from text. Most entity discovery systems are trained using pre-defined classes (e.g., Person, Location, Organization, etc.). However, in ARC and OpenbookQA, vast majority entity mentions are from scientific domain (e.g.,skin surface”, “oil”, “magnet”, and “iron”). As there is currently no potent system for scientific domain, we simply consider all noun phrases as entity mentions.

Entity Linking task can be divided into two sub-tasks: candidate generation and entity disambiguation. Given a set of extracted entity mentions , we first generate an initial list of candidate entities for each entity mention , and then rank them to select the candidate entity with the highest score as the appropriate entity for linking.

A dictionary-based candidate generation approach Medelyan and Legg (2008) is adopted such that


where is a set of anchor links with the same anchor text , and is a subset of that points to entity . Then each initial list of candidate entities is re-ranked based on three measures: salience, similarity, and coherence Pan et al. (2015).

Salience is computed by using Wikipedia anchor links


where is a set of anchor links that point to entity , and is a set of all anchor links in Wikipedia.

Similarity refers to the context similarity between mention-entity pairs. We adopt a neural network model that jointly learns embedding of words and entities from Wikipedia 

Yamada et al. (2017). For each entity mention

, we build the vector representation of its context

using the vector representation of each word (exclude entity mention itself and stop words) in the context. Cosine similarity between the vector representation of each entity candidate

and is computed to measure similarity between mention and entity .

Coherence is driven by the assumption that if multiple mentions appear together within a sentence, their referent entities are more likely to be coherent in the KB. Following Huang2017, we construct a weighted undirected graph from KB, where is a set of all entities in KB and indicates that two entities and share some KB properties. The weight of , is computed as


where , are the sets of KB properties of and

respectively. After constructing the knowledge graph, we apply the graph embedding framework proposed by tang2015line to generate knowledge representations for all entities in the KB. Coherence between two entities

is modeled using cosine similarity between the vector representations of these two entities. Given a entity mention and its candidate entity , coherence score is defined as


where is the union of entities for coherent mentions of . Finally, we combine these measures to compute the final score for each entity candidate .

We apply the EDL system described above to the text of all questions and answer options. For each discovered and linked entity, its Wikipedia abstract is extracted and appended to the corresponding reference document of each (question, answer option) pair.

3 Experiments

3.1 Datasets

In our experiment, we use RACE Lai et al. (2017), which is the existing largest multiple-choice machine reading comprehension dataset, as the source task of transfer learning. We evaluate the performance of our methods on ARC Clark et al. (2016, 2018) (including ARC-Easy and ARC-Challenge) and OpenbookQA Mihaylov et al. (2018). All these tasks, which are collected from examinations that are carefully designed by human experts, contain a significant number of questions requiring external knowledge for question answering, and there still exists a big performance gap between humans and machines. We show the statistics of these datasets in Table 2.

Dataset Train Dev Test Total
ARC-Easy 2251 570 2376 5197
ARC-Challenge 1119 299 1172 2590
OpenBookQA 4957 500 500 5957
RACE 87866 4887 4934 97687
Table 2: Number of questions in the involved datasets. We use RACE as the source dataset in transfer learning.

3.2 Experimental Settings

We use the pre-trained uncased released by Devlin et al.. We set the batch size to , learning rate to , and maximum sequence length to . We fine-tune for epochs on RACE and on the other datasets. We show the accuracy of our implemented BERT baseline on the RACE dataset in Table 3.

Dataset Dev Test
RACE-M 76.7 76.6
RACE-H 71.0 70.1
RACE 72.7 72.0
Table 3: Accuracy (%) of the fine-tuned BERT baseline on the RACE test set. RACE-M and RACE-H are two RACE subsets, representing questions collected from middle and high school language exams, repectively.

3.3 Experimental Results

As shown in Table 1, we see consistent improvements in accuracy across all tasks after we apply EDL to enrich the reference document for each question. For example, given the following question: “Which of the following statements best explains why magnets usually stick to a refrigerator door?” and its four answer options:

“The refrigerator door is smooth.”

“The refrigerator door contains iron.”

“The refrigerator door is a good conductor.”

“The refrigerator door has electric wires in it.”

by using EDL, we link the mention “magnets” to its corresponding Wikipedia entry Magnet and attach its description in Wikipedia “A magnet is a material or object that produces a magnetic field. This magnetic field is invisible but is responsible for the most notable property of a magnet: a force that pulls on other ferromagnetic materials, such as iron, and attracts or repels other magnets.” after its reference document.

Based on our preliminary experiments, we see further improvements on all the datasets except ARC-Easy, by fine-tuning the baseline model on the training instances of all the multiple-choice question answering datasets (i.e., ARC-Easy, ARC-Challenge, and OpenBookQA).

4 Related Work

4.1 Question Answering

Recent years have seen numerous datasets Richardson et al. (2013); Rajpurkar et al. (2016); Lai et al. (2017); Mihaylov et al. (2018); Clark et al. (2018); Choi et al. (2018); Reddy et al. (2018); Sun et al. (2019) and models Chen et al. (2016); Wang et al. (2018b); Radford et al. (2018); Devlin et al. (2018); Sun et al. (2018) to drive progress in question answering. On the dataset side, our work primarily focuses on multiple-choice examination datasets designed by educational experts Lai et al. (2017); Clark et al. (2018); Mihaylov et al. (2018); Sun et al. (2019) since questions from these datasets are generally clean, error-free, and challenging Sun et al. (2019). On the model side, our work follows the general framework of discriminatively fine-tuning pre-trained language models on question answering tasks Radford et al. (2018); Devlin et al. (2018); Sun et al. (2018).

4.2 Utilization of External Knowledge

Previous work have explored many ways to leverage external knowledge. Several work Wang et al. (2018a); Sun et al. (2019) exploit the graph of general knowledge ConceptNet Speer et al. (2017). Chen et al. propose to tackle open domain question answering using Wikipedia. Ni et al. study improving information retriever by essnetial terms Khashabi et al. (2017). In comparison, our work primarily focuses on improving multiple-choice question answering by leveraging external in-domain and external open-domain knowledge, and, particularly is the first work to leverage knowledge via EDL.

5 Conclusion

In this work, we study improving question answering by utilizing in-domain external question answering datasets and untilizing out-domain external corpora to enrich the reference corpus. Preliminary experimental results on ARC and OpenBookQA datasets demonstrate the effectiveness of our proposed approaches.


  • Chen et al. (2016) Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In Proceedings of the ACL, pages 2358–2367, Berlin, Germany.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the ACL, pages 1870–1879, Vancouver, Canada.
  • Choi et al. (2018) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question answering in context. Proceedings of the EMNLP.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. CoRR, cs.CL/1803.05457v1.
  • Clark et al. (2016) Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter D Turney, and Daniel Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In Proceedings of the AAAI, pages 2580–2586, Phoenix, AZ.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR, cs.CL/1810.04805v1.
  • Huang et al. (2017) Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji, Xiang Ren, Jiawei Han, Lin Zhao, and James A. Hendler. 2017. Liberal entity extraction: Rapid construction of fine-grained entity typing systems. Big Data, 5.
  • Khashabi et al. (2018) Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the NAACL-HLT, pages 252–262, New Orleans, LA.
  • Khashabi et al. (2017) Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2017. Learning what is essential in questions. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 80–89.
  • Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the EMNLP, pages 785–794, Copenhagen, Denmark.
  • Liu et al. (2018) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by summarizing long sequences. In Proceedings of the ICLR, Vancouver, Canada.
  • McCandless et al. (2010) Michael McCandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT.
  • McNamara et al. (2004) Danielle S McNamara, Irwin B Levinstein, and Chutima Boonthum. 2004. iSTART: Interactive strategy training for active reading and thinking. Behavior Research Methods, Instruments, & Computers, 36(2):222–233.
  • Medelyan and Legg (2008) O. Medelyan and C. Legg. 2008. Integrating cyc and wikipedia: Folksonomy meets rigorously defined common-sense. In

    Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence

    , pages 13–18, Chicago, IL.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the EMNLP, Brussels, Belgium.
  • Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. In Proceedings of the NAACL-HLT, pages 839–849, San Diego, CA.
  • Ni et al. (2018) Jianmo Ni, Chenguang Zhu, Weizhu Chen, and Julian McAuley. 2018. Learning to attend on essential terms: An enhanced retriever-reader model for open-domain question answering. CoRR, cs.CL/1808.09492v4.
  • Ostermann et al. (2018) Simon Ostermann, Michael Roth, Ashutosh Modi, Stefan Thater, and Manfred Pinkal. 2018. SemEval-2018 Task 11: Machine comprehension using commonsense knowledge. In Proceedings of the SemEval, pages 747–757, New Orleans, LA.
  • Pan et al. (2015) Xiaoman Pan, Taylor Cassidy, Ulf Hermjakob, Heng Ji, and Kevin Knight. 2015. Unsupervised entity linking with abstract meaning representation. In Proceedings of the NAACL-HLT, pages 1130–1139, Denver, CO.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the NAACL-HLT, pages 2227–2237, New Orleans, LA.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In Preprint.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the EMNLP, pages 2383–2392, Austin, TX.
  • Reddy et al. (2018) Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. Coqa: A conversational question answering challenge. CoRR, cs.CL/1808.07042v1.
  • Richardson et al. (2013) Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the EMNLP, pages 193–203, Seattle, WA.
  • Salmerón et al. (2006) Ladislao Salmerón, Walter Kintsch, and José J Caãs. 2006. Reading strategies and prior knowledge in learning from hypertext. Memory & Cognition, 34(5):1157–1171.
  • Speer et al. (2017) Robert Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the AAAI, pages 4444–4451, San Francisco, CA.
  • Sun et al. (2019) Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. DREAM: A challenge dataset and models for dialogue-based reading comprehension. Transactions of the Association of Computational Linguistics.
  • Sun et al. (2018) Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2018. Improving machine reading comprehension with general reading strategies. CoRR, cs.CL/1810.13441v1.
  • Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the WWW, pages 1067–1077, Florence, Italy.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS, pages 5998–6008, Long Beach, CA.
  • Wang et al. (2018a) Liang Wang, Meng Sun, Wei Zhao, Kewei Shen, and Jingming Liu. 2018a. Yuanfudao at SemEval-2018 Task 11: Three-way attention and relational knowledge for commonsense machine comprehension. In Proceedings of the SemEval, pages 758–762, New Orleans, LA.
  • Wang et al. (2018b) Shuohang Wang, Mo Yu, Shiyu Chang, and Jing Jiang. 2018b. A co-matching model for multi-choice reading comprehension. In Proceedings of the ACL, pages 1–6, Melbourne, Australia.
  • Yamada et al. (2017) Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2017. Learning distributed representations of texts and entities from knowledge base. CoRR, abs/1705.02494.
  • Zhang and Seepho (2013) Lian Zhang and Sirinthorn Seepho. 2013. Metacognitive strategy use and academic reading achievement: Insights from a chinese context. Electronic Journal of Foreign Language Teaching, 10(1).
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE ICCV, pages 19–27, Santiago, Chile.

Appendix A Appendices

Base Model Input Sequence Finetuning Datasets Weight Count
R+C 1 7
R+C 1 4
R+C 1 4
R+C+E+O 3 2
R+C+E+O 3 2
R+C+E+O 3 3
R+C+E+O 3 3
R+C+E 3 1
R+C+E+O 3 1
R+C+E+O 3 1
Table 4: Settings of ARC-Challenge Models. : R (RACE), C (ARC-Challenge), E (ARC-Easy), O (OpenBookQA). : (start token in GPT), (delimiter token in GPT), (end token in GPT), ([CLS] token in BERT), ([SEP] token in BERT).
Input Sequence Finetuning Datasets Count
R+E 2
R+E 2
R+E 1
R+E 4
R+E 2
R+E+C+O 1
R+E+C+O 2
R+E+C+O 2
R+E+C+O 1
R+E+C+O 1
Table 5: Settings of ARC-Easy Models. : R (RACE), C (ARC-Challenge), E (ARC-Easy), O (OpenBookQA). : ([CLS] token in BERT), ([SEP] token in BERT).
Input Sequence Finetuning Datasets
Table 6: Settings of OpenBookQA Models. : R (RACE), C (ARC-Challenge), E (ARC-Easy), O (OpenBookQA), O* (OpenBookQA with 54.6% instances dropped). : ([CLS] token in BERT), ([SEP] token in BERT).
Input Sequence
Table 7: BERT Segmentation Embedding Settings for Different Input Sequences. We add segmentation embedding A to the underlined part and B to the rest.

a.1 Engineering Details of the Strong Systems Used for Comparison

When we were in the middle of paper preparation, to make a competitive comparison, we put semi-complex engineering effort into making strong systems for ARC-Challenge, ARC-Easy, and OpenBookQA. These systems employed the approach of concurrently fine-tuning on multiple target datasets (Section 2.2), system ensembles based on a generalization of reading strategies Sun et al. (2018), and different pre-trained language models Radford et al. (2018); Devlin et al. (2018). We describe their details in this section.

a.1.1 Approach Overview

  • Reference Documents Given a question and an option, we employed the same approach as Sun et al. to retrieve relevant sentences from the corpus provided by each dataset and regard the concatenation of the retrieved sentences as the reference document (Section 2.1). We did not leverage any further steps such as EDL (Section 2.3) to enrich the reference document.

  • Pre-trained Language Models We mainly employed BERT Devlin et al. (2018) as the pre-trained language model. We used uncased for all our BERT-based models. Besides, we also employed GPT Radford et al. (2018) for ARC-Challenge.

  • Fine-Tuning Strategies Following Sun et al., all our models were first fine-tuned on the RACE dataset Lai et al. (2017). In our GPT-based model, we employed self-assessment (SA) and highlighting (HL) reading strategies Sun et al. (2018) and followed their input representation accordingly. In our BERT-based models, we generalized the back-and-forth reading strategy Sun et al. (2018) by training models with more diverse input sequence order and ensembling them simultaneously rather than only ensembling model pairs with reverse or almost reverse input sequence order.

  • Utilization of In-Domain Data We employed the approach of simultaneously fine-tuning on multiple target datasets described in Section 2.2 with the exception that we randomly dropped a portion of training instances in OpenBookQA when simultaneously fine-tuning on multiple target datasets for OpenBookQA.

a.1.2 Settings for Each Tasks

  • ARC-Challenge The system for ARC-Challenge was composed of 29 models (Table 4

    ). The final prediction for each question is the option with the largest weighted average logit, where we simply set weight 1 for all models that only use RACE and ARC-Challenge for fine-tuning and 3 for the other models. The BERT segmentation embedding settings for different input sequences are detailed in Table 


  • ARC-Easy The system for ARC-Easy was composed of 18 models (Table 5). Different from ARC-Challenge, we only employed , and all models have equal weights (i.e., the final prediction for each question is the option with the largest average logit).

  • OpenBookQA The system for OpenBookQA was composed of 5 models (Table 6). Different from ARC, we employed only one model for each used input sequence. Moreover, we dropped 54.6% OpenBookQA training instances when fine-tuning on multiple datasets.

We trained our BERT-based models with the same settings as Section 3.2 and our GPT-based model with the same settings as Sun et al..