Knowledge Guided Text Retrieval and Reading for Open Domain Question Answering

11/10/2019 ∙ by Sewon Min, et al. ∙ Princeton University 30

This paper presents a general approach for open-domain question answering (QA) that models interactions between paragraphs using structural information from a knowledge base. We first describe how to construct a graph of passages from a large corpus, where the relations are either from the knowledge base or the internal structure of Wikipedia. We then introduce a reading comprehension model which takes this graph as an input, to better model relationships across pairs of paragraphs. This approach consistently outperforms competitive baselines in three open-domain QA datasets, WebQuestions, Natural Questions and TriviaQA, improving the pipeline-based state-of-the-art by 3–13

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Open-domain question answering systems aim to answer any question a user can pose, with evidence provided by either factual text such as Wikipedia (Chen et al., 2017; Yang et al., 2019) or knowledge bases (KBs) such as Freebase (Berant et al., 2013; Kwiatkowski et al., 2013; Yih et al., 2015). Textual evidence, in general, has better coverage but KBs more directly support making complex inferences. It remains an open question how to best make use of KBs without sacrificing recall. Previous work has converted KB facts to sentences to provide extra evidence (Weissenborn et al., 2017; Mihaylov and Frank, 2018), but do not explicitly use the KB graph structure. In this paper, we show that such structure can be highly beneficial for fusing information across passages in open-domain text-based QA.

Figure 1: An example from Natural Questions. A graph of passages is constructed based on Wikipedia and Wikidata, where the edges are either cross-document or inner-document relations. While the model which reads each passsage in parallel outputs the wrong answer (red), the model which synthesizes the context over passages predicts the correct answer (blue).

We introduce a general approach for text-based open-domain QA that models knowledge-rich interactions between passages using structural information from a knowledge base. Our goal is to combine the high coverage of textual corpora with the structural information and relationships apparent in knowledge bases, to improve both the recall and accuracy of the resulting model. Different from standard approaches where a model retrieves and reads a set of passages, we integrate graph structure at every stage to construct, retrieve and read a graph of passages.

Our approach first retrieves a graph using both text and KB, in which each node is a passage of text, and the relations are either from the knowledge graph or the internal structure of Wikipedia (Figure 

1). Then, we introduce a reader model which takes this graph as an input, and considers relations between passage pairs in order to answer the question. Using the external knowledge base, our approach is able to synthesize the context over passages and build a knowledge-rich representation of text passages, which contributes to choosing the correct evidence context to derive the answer.

Our experiments demonstrate significant improvements on three popular open-domain QA datasets: WebQuestions (Berant et al., 2013), Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). Our novel graph-based QA model improves the accuracy consistently, and significantly outperforms previous pipeline-based state-of-the-art by 3–13%. Our ablations show that both graph-based retrieval and reader model substantially contribute to the performance improvements, even when we fix the other component.

2 Related Work

Text-based Question Answering.

Text-based open-domain QA is a long studied problem Voorhees et al. (1999); Ferrucci et al. (2010). Recent work has focused on two stage approaches that combine information retrieval with neural reading compression Chen et al. (2017); Wang et al. (2018); Das et al. (2019); Yang et al. (2019). We follow this tradition but introduce a new framework which retrieves and reads a graph of passages. Similar methods for graph-based retrieval were also developed concurrently to our approach Godbole et al. (2019), and can be easily adapted to our framework. However, to best of our knowledge, reading passages by incorporating relations between passages has not been previously studied.

After the retrieval step, the next challenge is to find the answer given a set of passages. Most work concatenates passages into a single sequence Swayamdipta et al. (2018); Yang et al. (2018); Song et al. (2018) or reads each in parallel Clark and Gardner (2018); Alberti et al. (2019); Min et al. (2019b); Wang et al. (2019). The most related work to ours includes Song et al. (2018) and Cao et al. (2019), which construct a graph of entities in the passage through entity detection and coreference resolution. In contrast, we focus on building a graph of passages, to better model the overall relationships between the passages.

Other lines of research in open-domain QA include joint learning of retrieval and reader components Lee et al. (2019) or direct phrase retrieval in a large collection of documents Seo et al. (2019). Although end-to-end training of our approach can further improve the performance, this paper only focuses on pipeline approaches since end-to-end training is computationally and memory expensive.

Knowledge Base Question Answering.

Question answering over knowledge bases has also been studied Berant et al. (2013); Kwiatkowski et al. (2013); Yih et al. (2015), typically without using any external text collections. However, recent work has augmented knowledge bases with text from Wikipedia Das et al. (2017); Sun et al. (2018, 2019), to increase factual coverage. In this paper, we study what can be loosely seen as an inverse problem. The model answers questions based on a large set of documents, and the knowledge base is used to better model relationships between different passages of text.

3 Approach

We present a new general approach for text-based open-domain question answering, which consists of a knowledge graph based retrieval model GraphRetriever and a reading comprehension model GraphReader. GraphRetriever constructs a graph of passages for the question from a large collection of documents (Section 3.1) and GraphReader reads the input graph and finally returns an answer (Section 3.2).

3.1 Graph-Based Retrieval

Our retrieval approach leverages external knowledge to guide the selection of passages. GraphRetriever retrieves a graph of passages in which vertices are passages and edges denote relationships between passages. The passages are retrieved from documents that are related to entities in the question or have high TF-IDF similarity score to the question. Retrieved passages are denoted by , and relations are denoted by , where is a relation indicating whether there is a relation between and or no relation. To achieve this, we use Wikipedia and WikiData as resources to retrieve passages, and combine two methods—entity-based retrieval and text-match retrieval.

Entity-based retrieval.

Our first entity-based retrieval method uses the graph structure of Wikidata to extract text passages related to the question. It first identifies entities in the question using an entity linking system and then expands these seed entities with entities connected to them in Wikidata. Our method retrieves the Wikipedia documents corresponding to the selected entities and connects those passages according to the relation type in the WikiData graph. Formally, a relation between two passages and is a WikiData relation (e.g., performer and part of in Figure 1) if and are passages corresponding to entities that are connected in WikiData.

Text-match retrieval.

Our text-match retrieval method extends TF-IDF-based retrieval from Chen et al. (2017) to select top articles from Wikipedia, split each article into passages, and then run BM25 (Robertson et al., 2009) on top of these passages to score them. We simply construct relations between passages if they belong to the same Wikipedia article: is child and is parent if is the first passage of the article and is another passage from the same article.

Retrieved passages from each retrieval method are sorted based on entity linking scores of originated seed entities from the entity-based retrieval or BM25 scores from the text-match retrieval. The highest scoring passages and their relations are used as the final graph.111Although we considered Personalized PageRank Haveliwala (2002) following Sun et al. (2018), our preliminary result shows no improvement over this naive combination.

3.2 Graph-Based Reading Comprehension

Our GraphReader takes a question and retrieved passages (and their relations ) and aims to output an answer to the question given the list of retrieved passages. Instead of processing each passage independently, the key idea of our approach is to improve passage representations using different fusion layers to integrate individual passages using the graph structure into account. The overall architecture is illustrated in Figure 2.

Figure 2: A diagram of GraphReader where the input graph is , , , in Figure 1 and funsion layers aggregate information from other passages. Details are described in Section 3.2.

3.2.1 Initial Paragraph Representation

Formally, given the question and a paragraph , GraphReader first obtains a question-aware passage representation:

where is the maximum length of each passage, and is the hidden dimension. This representation can be obtained from strong pretrained models such as BERT (Devlin et al., 2019), although not necessary. Additionally, GraphReader encodes each representation through a relation encoder:

3.2.2 Fusing Paragraph Representations

GraphReader then builds graph-aware fusion layers to update passage representations by propagating information through edges of the graph. For each fusion layer , GraphReader obtains new passage representation based on all the adjacent passages and their relations through a composition function (described below). We investigate two variants of the fusion layers, one with a gating function and one without.

Simple fusion.

Simple fusion combines and the previous representation through a linear projection layer and finally obtains the updated representation :

where , and and are both learnable parameters.

Gated fusion.

Different from simple fusion, the gated fusion uses and

to compute a gating vector

that controls how much information should be updated from the newly proposed :

We now describe several different ways to define , i.e., how to obtain from and .

Binary.

We first consider binary relations, whether a passage pair is related or not without incorporating relation types. Specifically,

where and are learnable parameters.

Relation-aware.

We then consider a more sophisticated, relation-aware composition function:

where and are learnable.

We compare with alternative fusion methods in Section 5.1, where the result indicates that different ways to incorporate relation information do not give significantly different performance from each other, but this element-wise multiplication based method performs well consistently across all datasets.

3.2.3 Answering Questions

GraphReader uses the updated passage representations (denote

) and computes the probability of each span

in the passage being an answer as . Specifically,

where are learnable vectors.

4 Experiments

Dataset Statistics Graph density
Train Dev Test Cr. In. Tot.
WebQ 3417 361 2032 0.9 1.9 2.8
NaturalQ 79168 8757 3610 0.7 2.2 2.9
TriviaQA 78785 8837 11313 0.8 2.0 2.8
Table 1: The Statistics and density of the graph (% of passage pairs with a relation from GraphRetriever are reported. Cr., In. and Tot. denote cross-document relations (WikiData relations) and inner-document relations (child, parent) and their sum.
Retriever Reader WebQuestions Natural Questions TriviaQA
Dev Test Dev Test Dev Test
Text-match ParReader 23.6 25.2 26.1 25.8 52.1 52.1
Text-match ParReader++ 19.9 20.8 28.9 28.7 54.5 54.0
GraphRetriever ParReader 27.6 27.5 27.3 26.4 52.1 52.4
GraphRetriever ParReader++ 29.4 29.5 30.5 29.4 54.5 53.9
GraphRetriever GraphReader (binary) 30.8 31.6 32.6 31.8 54.9 54.1
GraphRetriever GraphReader (relation) 32.1 31.6 32.9 31.2 55.7 55.4
SOTA (pipeline) - 18.5 28.8 28.1 50.7 50.9
SOTA (end-to-end) 38.5 36.4 31.3 33.3 45.1 45.0
Table 2: Overall results on the development and the test set of three datasets. We also report state-of-the-art results, both with pipeline and end-to-end: Lin et al. (2018), Min et al. (2019a), Lee et al. (2019). Note that the development sets used in Lee et al. (2019) are slightly different but the test sets are the same.

4.1 Datasets

We evaluate our model on three open-domain question answering datasets: (1) WebQuestions (Berant et al., 2013) is originally a QA dataset designed to answer questions based on Freebase and the questions were collected through Google Suggest API. We use the same setting from Chen et al. (2017). (2) Natural Questions (Kwiatkowski et al., 2019) is a dataset where questions are collected from Google search engine and designed for end-to-end open-domain question answering. (3) TriviaQA (Joshi et al., 2017) is a dataset where questions are from trivia and quiz-league websites. For all these datasets, we only use question and answer pairs for training and testing (and discard the documents which are provided in the reading comprehension datasets). We following the data split from Chen et al. (2017) for WebQuestions and Min et al. (2019a) for Natural Questionsand TriviaQA.222https://bit.ly/2q8mshc and https://bit.ly/2HK1Fqn. We provide the statistics of the datasets and also the density of the graph retrieved by our GraphRetriever in Table 1.

4.2 Baselines

For retrieval, we compare our GraphRetriever to a pure text-match based retriever that we described in Section 3.1 and investigate if leveraging the knowledge graph actually improves the retrieval component.

We also compare several different reader models as follows. (1) ParReader reads each passage in parallel and predicts a candidate span from the passage and an answerable score (Min et al., 2019b; Wang et al., 2019). The passage with the highest answerable score during testing will be used for answer selection. (2) ParReader++ is an improved version of ParReader which still reads each passage in parallel but applies a normalization across candidate passages for answerable scores.333This is a slight modification from Shared Normalization (Clark and Gardner, 2018); our preliminary result indicates this variant slightly outperforms original Shared Normalization. (3)-(4) GraphReader (Binary/Relation-aware) denote our main model, described in Section 3.2. “Binary” is only given whether there is a relation or not between a pair of passages, while relation types are used for “Relation-aware” models.

4.3 Training Details

We summarize important training details, where full details can be found in Appendix A. For retrieval, we use the Wikipedia dump from 2018-12-20444archive.org/download/enwiki-20181220 following Lee et al. (2019) and the Wikidata dump from 2019-06-01555 dumps.wikimedia.org/wikidatawiki/20190601. We split the article into passages with natural breaks and merge consecutive ones with up to a maximum length of 300 tokens.

When training GraphReader, we cannot feed too big graph in the same batch due to the memory constraint. Therefore, for every parameter update, we sample a subgraph with at most passages where at least one of them contains the answer text. In Section 5.1, we show ablations on different ways of sampling. For inference, we experiment with and choose the best one on the development set for testing.

We use the uncased, base version of BERT (Devlin et al., 2019) for question-aware passage representations . For the relation encoder, we keep top relations on each dataset and train an embedding matrix for each relation. For each model, we experiment with

and both simple and gated fusion layers, and choose the number that gives the best result on the development set. More detailed hyperparameters are provided in Appendix 

A.

4.4 Main Results

The main results are given in Table 2. We observed:

  1. GraphRetriever offers substantial performance gains over text-match retrieval when we compare within the same reader, especially on WebQuestions and Natural Questions. This indicates that graph-based retrieval itself is beneficial to retrieve context passages.

  2. GraphReader offers performance improvements over two ParReader baselines, demonstrating that GraphReader with cross-passage reading of the text is more effective than reading each passage in parallel.

  3. Models using relation types offer consistent improvements over those with binary relations. However, the improvements are smaller than we expected, likely because the relation types are easily inferred based on text passages.

We also compare our results to the state-of-the-art, both pipeline and end-to-end approaches, in Table 2. In particular, our best performing model outperforms previous pipeline approaches state-of-the-art by 3–13%. Compared to the state-of-the-art end-to-end approaches where the retriever and the reader model are trained jointly, our model is behind on WebQuestions, comparable on Natural Questions or better on TriviaQA. In particular, as observed in Lee et al. (2019), WebQuestions benefits greatly from end-to-end training. Although not explored in this paper, our framework can be trained end-to-end as well, which has a great potential to further advance the state-of-the-art.

5 Analyses

In the following, we conduct ablation studies (Section 5.1) and analyze the effects of several design choices, including different relation types, composition functions and comparisons to concatenated-passage baselines. Finally, we present qualitative results in Section 5.2.

5.1 Ablation Studies

WebQuestions Natural Q.
Fully connected 29.4 31.1
Empty 28.4 31.6
Cross-doc 30.2 31.4
Inner-doc 28.9 31.3
Cross+Inner 30.8 32.0
Table 3: Effects of different relation types. GraphRetriever and GraphReader with binary relations and simple fusion layer are used and the reported numbers are from the development sets.
WebQ Natural Q. TriviaQA
Binary 30.8 32.6 54.9
Elm-wise 32.1 32.9 55.7
Concat 30.8 32.3 55.7
Bilinear 29.4 32.3 55.8
Multi-task 30.8 31.8 53.9
Table 4: Comparisons on different composition functions. Note that GraphReader (Elm-wise) is same as GraphReader (Relation-aware) in other tables. For all rows, GraphRetriever is used, and numbers on the development set are reported.
WebQ Natural Q.
ParReader++ 29.4 30.5
ParReader++ (pair) 28.4 27.8
GraphReader 32.1 32.4
Table 5: Comparison to passage concatenation. For all rows, GraphRetriever is used, and numbers on the development set are reported.
Question: What county is St. Louis Park in?
Answer: Hennepin County
Graph:
: [Saint Louis Park] Saint Louis Park is a city in Hennepin County, Minnesota, United States.
: [Hennepin County] Hennepin County is a county in the U.S. state of Minnesota.
: [St. Louis County] St. Louis County, Missouri is located in the far eastern portion of the U.S. state of Missouri.
Question: Which country did Nike originate from?
Answer: United States of America
Graph:
: [Nike, Inc.] Nike, Inc. is an American multinational corporation that is engaged in …
: [Nike, Inc.] In April 2014, one of the biggest strikes in mainland China took place at the Yue Yuen industrial holdings
dongguan shoe factory, producing amongst others for Nike.
: [United States] United States of America is a country comprising 50 states, a federal district, and …
Question: Who sang more than a feeling by boston?
Answer: Brad Delp
Graph:
: [More Than a Feeling] ”More Than a Feeling” is a song by the American rock band Boston. Written by Tom Scholz, …
: [More Than a Feeling] Personnel. Tom Scholz - acoustic and electric rhythm guitar, lead guitar, bass. Brad Delp - vocals.
: [Boston (album)] Boston is the debut studio album by American rock band Boston. Produced by Tom Scholz and John
Boylan, … He subsequently started to concentrate on demos recorded in his apartment basement with singer Brad Delp.
Question: The sustainable development goals (SDGS) were adopted by the UN general assembly in what year?
Answer: 2015
Graph:
: [Sustainable Development Goals] The Sustainable Development Goals (SDGs) are a collection of 17 global goals set
by the United Nations General Assembly in 2015 for the year 2030.
: [Sustainability] The Sustainable Development Goals replace the eight Millennium Development Goals (MDGs), which
expired at the end of 2015. The MDGs were established in 2000 following the Millennium Summit of the United Nations.
Question: When is the world population expected to reach 8 billion?
Answer: 2024
Graph:
: [Projections of population growth] Projections of population growth established in 2017 predict that the human popu-

lation is likely to keep growing until 2100, reaching an estimated 8.6 billion in

2030, 9.8 billion in 2050 and …
: [World population] According to current projections, the global population will reach eight billion by 2024, and …
Table 6: Examples from WebQuestions (top 2) and Natural Questions (bottom 3) where predictions from ParReader and GraphReader are denoted by red and blue text, respectively. For both models, GraphRetriever was used. Constructed graphs are reported; triples with no relation relation are omitted. The [Bold] text of each passage denotes title of the Wikipedia article where the passage is originated.
Effect of different relation types.

Table 3 compares the effect of different relation types in the constructed graph of passages, showing results for the following settings: (a) fully connected, that ignores the relation types and connects all pairs of passages, (b) empty, that does not include any edge between passages, (c) cross-doc, that only includes edges between passages according to the Wikidata relation types, (d) inner-doc, that only include edges between passages within a document, and (e) cross+inner, that include both cross-doc and inner-doc, corresponding to the graph constructed by our approach. Note that (a) and (b) do not use any relation information from the input graph—they only consider input passages.

The results show that including cross-doc and inner-doc relations consistently outperforms other variants. In particular, using relation information is better than ignoring relation information (fully connected, empty), demonstrating the importance of selecting a good set of graph edges. Finally, cross edges play a more important role compared to inner edges, showcasing the importance of incorporating external knowledge from Wikidata.

Effect of different composition functions for relations.

Table 4 demonstrates the effect of replacing the element-wise composition function for incorporating relations in GraphReader (explained in Section 3.2) with other composition functions. (a) Binary does not include any relation type, (b) Elm-wise uses the element-wise function described in Section 3.2, (c) Concat concatenates the passage representation and the relation representation, (d) Bilinear uses a bilinear function, and (e) Multi-task

uses a multi-task objective of answering the question and classifying the relation given a passage pair, instead of incorporating relation representations. Details are described in Appendix 

A.

Table 4 shows that incorporating different composition functions does not significantly impact the performance, while element-wise multiplication method (GraphReader (Elm-wise)) achieves the best or near best performance across all datasets. Interestingly, we observe that the relation classification accuracy is 88-90% on WebQuestions and Natural Questions and 78% on TriviaQA, indicating that classifying the relation given a passage pair is relatively easy.666The classification accuracy is high only when whether there is a relation or not is given, because when it is not given, the classifier always predicts no relation. This also supports why GraphReader (Binary), which only considers the presence of a relation, achieves strong results, sometimes outperforming other relation-aware models.

Comparison to passage concatenation.

Table 5 compares the performance of our graph-based method with a model that concatenates passages for the reading stage. We compare with ParReader++  which is similar to ParReader++ but is additionally given concatenated passage pairs which are related based on the input graph, along with their relation text. For this baseline, we split each passage up to 140 tokens since concatenated passages should be up to 300 tokens.777The rest 20 tokens are used for the relation text. Table 5 shows that concatenating passages is not competitive, potentially because truncating each passage causes significant information loss.

5.2 Qualitative Results

Table 6 shows a few examples from WebQuestions and Natural Questions. The common observation across different datasets is that our method incorporates knowledge-rich relationships between passages to find the correct evidence and answer questions. This is in contrast to the independent reading of passages in ParReader.

For example, the first question in Table 6 seems straightforward since P1 explicitly mentions the evidence. However, ParReader selects P3 as the evidence context, potentially because it assumes “St. Louis Park” and “St. Louis County” are related. In GraphReader, the connection between P1 and P2 strongly hints that “St. Louis Park” is not in “St. Louis County”. For the second question ParReader assigns a very high probability to “China” because “China” is the only country name explicitly mentioned in the article “Nike, Inc”. However, the connection between P1 and P3 strongly hints that the answer should be “United States”, which helps GraphReader predict the correct answer. For the third example, ParReader predicts “Tom Scholz” as an answer, potentially because there is bias that the first person in the passage about a song is likely to be the singer. However, GraphReader which is aware of the connection between P1 and P3 predicts the correct answer. In the last two examples, independent reading of the passages in ParReader lead to the wrong selection of the evidence passage. However, our GraphReader reads P1 and P2 jointly and selects the correct evidence paragraph even when the two passages are not related baed on an external knowledge base.

6 Conclusion

We proposed a general approach for open-domain question answering (QA) that models interactions between paragraphs using structural information from a knowledge base. Unlike standard approaches where a model retrieves and reads a set of passages, we integrate graph structure at every stage to construct, retrieve and read a graph of passages. Our approach consistently outperforms competitive baselines in three open-domain QA datasets, WebQuestions, Natural Questions and TriviaQA, and we also include a detailed qualitative analysis to illustrate where the cross paragraph reading contributes the most to the overall system performance.

References

  • Alberti et al. (2019) Chris Alberti, Kenton Lee, and Michael Collins. 2019. A BERT baseline for the Natural Questions. arXiv preprint arXiv:1901.08634.
  • Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In EMNLP.
  • Cao et al. (2019) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents with graph convolutional networks. In NAACL.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In ACL.
  • Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In ACL.
  • Das et al. (2019) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain question answering. In ICLR.
  • Das et al. (2017) Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. 2017. Question answering on knowledge bases and text using universal schema and memory networks.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
  • Ferragina and Scaiella (2011) Paolo Ferragina and Ugo Scaiella. 2011. Fast and accurate annotation of short texts with wikipedia pages. IEEE software.
  • Ferrucci et al. (2010) David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. 2010. Building Watson: An overview of the DeepQA project. AI magazine, 31(3):59–79.
  • Godbole et al. (2019) Ameya Godbole, Dilip Kavarthapu, Rajarshi Das, Zhiyu Gong, Abhishek Singhal, Hamed Zamani, Mo Yu, Tian Gao, Xiaoxiao Guo, Manzil Zaheer, et al. 2019. Multi-step entity-centric information retrieval for multi-hop question answering. In Workshop on Machine Reading for Question Answering EMNLP.
  • Haveliwala (2002) Taher H Haveliwala. 2002. Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web, pages 517–526. ACM.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
  • Kwiatkowski et al. (2013) Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In EMNLP.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Change, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. TACL.
  • Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In ACL.
  • Lin et al. (2018) Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. 2018. Denoising distantly supervised open-domain question answering. In ACL.
  • Mihaylov and Frank (2018) Todor Mihaylov and Anette Frank. 2018. Knowledgeable reader: Enhancing cloze-style reading comprehension with external commonsense knowledge. In ACL.
  • Min et al. (2019a) Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019a. A discrete hard EM approach for weakly supervised question answering. In EMNLP.
  • Min et al. (2019b) Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019b. Compositional questions do not necessitate multi-hop reasoning. In ACL.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017.

    Automatic differentiation in PyTorch.

  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval.
  • Seo et al. (2019) Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur P Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-time open-domain question answering with dense-sparse phrase index. In ACL.
  • Song et al. (2018) Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. 2018. Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014.

    Dropout: a simple way to prevent neural networks from overfitting.

    Journal of machine learning research

    .
  • Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. In EMNLP.
  • Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William W Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In EMNLP.
  • Swayamdipta et al. (2018) Swabha Swayamdipta, Ankur P Parikh, and Tom Kwiatkowski. 2018. Multi-mention learning for reading comprehension with neural cascades. In ICLR.
  • Voorhees et al. (1999) Ellen M Voorhees et al. 1999. The TREC-8 question answering track report. In Trec.
  • Wang et al. (2018) Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R: Reinforced ranker-reader for open-domain question answering. In AAAI.
  • Wang et al. (2019) Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallapati, and Bing Xiang. 2019. Multi-passage BERT: A globally normalized bert model for open-domain question answering. In ACL.
  • Weissenborn et al. (2017) Dirk Weissenborn, Tomáš Kočiskỳ, and Chris Dyer. 2017. Dynamic integration of background knowledge in neural nlu systems. arXiv preprint arXiv:1706.02596.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
  • Yang et al. (2019) Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with bertserini. In ACL.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
  • Yih et al. (2015) Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL.

Appendix A Training Details

Hyperparameter WebQuestions Others
Batch size for ParReader 10 56
Batch size for ParReader++ and GraphReader 8 16
# of sampled paragraphs for each question during training 20 20
Interval for evaluation on the development set during training 1k 2k
# of paragraphs on the development set during training 80 20
Table 7: Hyperparameters used for experiments. We use the different values for WebQuestions because the dataset size is much smaller than others.

We use WikiExtractor888github.com/attardi/wikiextractor, TAGME Ferragina and Scaiella (2011)999github.com/gammaliu/tagme and Rank-BM25101010github.com/dorianbrown/rank_bm25 for Wikipedia dump parsing, entity extraction and BM25.

For a text-match retrieval, we use top articles based on bi-gram TFIDF scores from Document Retriever in DrQA (Chen et al., 2017). Each article is split into paragraphs with up to 300 tokens, and fed into a BM25 ranker altogether. We only consider the first 20 hyperlinks for each article. We observe of links are redirect pages and discard them.

All experiments are done in Python 3.5 and PyTorch 1.1.0 (Paszke et al., 2017). For BERT, we use BERT and pytorch-transformers (Wolf et al., 2019)111111github.com/huggingface/pytorch-transformers. Specifically, given a question and a paragraph where the title of the originated article is , we form a sequence , where : indicates a concatenation and is a special token. This sequence is then fed into BERT

 and the hidden representation of the sequence from the last layer is chosen as a question-aware paragraph representation. When

BERT is used for relation representations, we follow the similar strategy as above. When question-aware representations are used, we form a sequence . Otherwise, we just use as an input sequence. When an embedding matrix is used, we only consider 75 relations, including the top 74 relations from the training set 121212They cover 93% of relations on the dev set. and UNK.

For each fusion layer, we apply dropout (Srivastava et al., 2014) with a probability of . For training, we evaluate the model on the development set periodically, and stop training when Exact Match score does not improve times. Other hyperparameter values are reported in Table 7. For all other hyperparameters not mentioned, we follow the default settings from pytorch-transformers. At training time, the highest scoring passages with and without the answer text are used, mostly due to memory limitations.

Details for different composition functions

GraphReader (Concat) concatenates the paragraph representation and the relation representation, , where and are learnable. GraphReader (Bilinear) uses a bilinear funcion as: , where is a learnable matrix. GraphReader (Multi-task) uses a multi-task objective of answering the question and classifying the relation given a paragraph pair, instead of incorporating relation representations. Details are described in Appendix. is computed similar to GraphReader (Binary). Relation classifier is trained as: , where is a number of unique relations and and are learnable. We train with a cross entropy objective using and the groundtruth relation, only when the groundtruth relation is not no relation.