Knowledge Based Machine Reading Comprehension

09/12/2018 ∙ by Yibo Sun, et al. ∙ Beihang University Microsoft Harbin Institute of Technology 0

Machine reading comprehension (MRC) requires reasoning about both the knowledge involved in a document and knowledge about the world. However, existing datasets are typically dominated by questions that can be well solved by context matching, which fail to test this capability. To encourage the progress on knowledge-based reasoning in MRC, we present knowledge-based MRC in this paper, and build a new dataset consisting of 40,047 question-answer pairs. The annotation of this dataset is designed so that successfully answering the questions requires understanding and the knowledge involved in a document. We implement a framework consisting of both a question answering model and a question generation model, both of which take the knowledge extracted from the document as well as relevant facts from an external knowledge base such as Freebase/ProBase/Reverb/NELL. Results show that incorporating side information from external KB improves the accuracy of the baseline question answer system. We compare it with a standard MRC model BiDAF, and also provide the difficulty of the dataset and lay out remaining challenges.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In recent years, there has been an increasing interest in Machine reading comprehension (MRC), which plays a vital role in the assessment of how well a machine could understand natural language. Several datasets [Rajpurkar et al.2016, Onishi et al.2016, Hill et al.2015]

for machine reading comprehension have been released in recent years and have driven the evolution of powerful neural models. However, much of the research up to now has been dominated by answering questions that can be well solved solved using superficial information, yet struggles to do accurate natural language understanding and reasoning. For example, jia2017Adversarial jia2017Adversarial show that existing machine learning systems for MRC perform poorly under adversarial evaluation. Recent developments in MRC datasets 

[Kocisky et al.2018, Rajpurkar, Jia, and Liang2018, Welbl, Stenetorp, and Riedel2018] have heightened the need for deep understanding.

Knowledge has a pivotal role in accurately understanding and reasoning natural language in MRC. Previous research [Hirsch2003, Carrell1983] has established that human reading comprehension requires both words and world knowledge. In this paper, we consider words and world knowledge in the format of triplets (subject, predicate, object). Specifically, we believe the advantages of using knowledge in MRC are three-fold. First, utilizing knowledge in MRC supports reasoning over multiple triplets because a single triplet may not cover the entire question. Multi-hop reasoning is also a long-standing goal in question answering. Second, building a question answering system based on triplet-style knowledge facilitates the interpretability of the decision making process. Triplets organize the document together with KBs as a graph, where a well-designed model such as PCNet, which we will describe in a later section, expressly reveal rationales for their predictions. Third, representing the documents as knowledge allows for ease of accessing and leveraging the knowledge from external/background knowledge because the knowledge representation of a document is easily consistent with both manually curated and automatically extracted KBs.

In this paper, we present knowledge based machine reading comprehension, which requires reasoning over triplet-style knowledge involved in a document. However, we find published dataset do not sufficiently support this task. We conduct preliminary exploration on SQuAD [Rajpurkar et al.2016]. We use a strong open IE algorithm [Del Corro and Gemulla2013] to extract triplets from the documents and observe that only 15% of instances have an answer that is exactly the same as the corresponding subject/object in the extracted triplets. To do knowledge-based MRC, We build a new dataset consisting of 40,047 examples for the knowledge based MRC task. The annotation of this dataset is designed so that successfully answering the questions requires understanding and the knowledge involved in a document. Each instance is composed of a question, a set of triplets derived from a document, and the answer.

We implement a framework consisting of both a question answering model and a question generation model, both of which take the knowledge extracted from the document as well as relevant facts from an external knowledge base such as Freebase/ProBase/Reverb/NELL. The question answering model gives each candidate answer a score by measuring the semantic relevance between representation and the candidate answer representation in vector space. The question generation model provides each candidate answer with a score by measuring semantic relevance between the question and the generated question based on the semantics of the candidate answer. We implement an MRC model BiDAF 

[Seo et al.2017] as a baseline for the proposed dataset. To test the scalability of our approach in leveraging external KBs, we use both manually created and automatically extracted KBs, including Freebase [Bollacker et al.2008], ProBase [Wu et al.2012], NELL [Carlson et al.2010] and Reverb [Fader, Soderland, and Etzioni2011]. Experiments show that incorporating evidence from external KBs improves both the matching-based and question generation-based approaches. Qualitative analysis shows the advantages and limitations of our approaches, as well as the remaining challenges.

Task Definition and Dataset

We formulate the task of knowledge based machine reading comprehension, which is abbreviated as KBMRC, and describe the dataset built for KBMRC and the external open KBs leveraged in this work.

Task Definition

The input of KBMRC includes a natural language question , a knowledge base derived from the document , and potentially an external knowledge base . Both and consist of a set of triplets , in which a triplet is composed of a subject , a predicate , and one or more arguments , , etc. The output is an subject or an argument from which correctly answers the question . Figure 1 gives an example to illustrate the task.

Figure 1: An example that illustrates the task.

Dataset Construction

We build our dataset upon WebAssertions [Yan et al.2018], which is valuable to us as each sample of in this dataset contains a document-question pair and triplets extracted from the document by an open IE algorithm [Del Corro and Gemulla2013]. Questions are collected from the search log of a search engine and documents are collected from a commercial search engine’s search result. However, they do not annotate which subject/argument from a triplet is the answer. Based on this consideration, we make further annotations in order to point out which subject/argument is the correct answer. To reduce annotation efforts, we take every argument from a correct triplet as the candidate answer and provide each annotator with a question and a document. Annotators are asked to determine whether a subject/argument is a correct answer or not. We assign each candidate to three annotators and only collect the instances that are labeled as correct by no less than two annotators.

Statistics of the dataset are given in Table 1. The distribution of questions in our dataset is given in Table 2.

# of training instances 30,461
# of dev instances 4,481
# of test instances 5,105
Avg. triplets / document 5.85
Avg. arguments / document 14.27
Avg. predicates / document 5.85
Avg. words / argument 3.14
Avg. words / predicate 1.31
Avg. words / question 5.95
Table 1: Statistics of the dataset we create for KBMRC.

To measure the complexity of the dataset, we use a rule-based approach to detect the anchor and measure the coverage of correct answers through 1-hop and 2-hop paths. The details of how we get the anchor and the paths are described in Section Approach Overview. We observe that the coverage of 1-hop candidates is 55.6% and that of 2-hop candidates is 69.6%. This indicates that deep/multi-hop inference is required in this dataset.

Type who what where when how other
# 5815 6,377 5,860 3,423 8,628 9,944
% 14.5% 15.9% 14.6% 8.5% 21.5% 24.8%
Table 2: Distribution of questions in our dataset.

We believe that one advantage of KBMRC is that the form of knowledge representation of a document is analogous to that of large-scale knowledge bases. To explore the scalability of our approach, we investigate our publicly available KBs which have been used for open question answering. Statistical information on the open KBs are given in  fader2014open fader2014open, from which we can find that these KBs include both manually curated ones (e.g. Freebase) and the ones that are automatically extracted from web documents (e.g. Reverb, ProBase, NELL). We use each open KB individually in the experiment and report the results in the next section.

Approach Overview

Our framework consists of a question answering model and a question generation model.

We implement two question answering models, which directly measure the semantic similarity between questions and candidate answers in the semantic space. First, to make the model’s prediction more explainable, we implement a path based QA model PCNet. In this model, to get candidate answers from the triplets based document for a given question. We first retrieve the an “anchor” point111The anchor point is a subject/object mentioned in the question. or in the document fact . These anchors are selected based on edit distance between the words in the questions and the arguments. Then we regard all arguments in the 1-hops and 2-hops fact of the anchors as answers. However, the coverage of the candidate answers can be 100% in the model. We then implement a second end-to-end neural model KVMenNet which covers all the answers but with less interpretability. Both models generate a score of each candidate answer.

We then implement a generation-based model. The motivation to design this model is that we want to associate natural language phrases with knowledge based representation. It takes semantics of a candidate answer as the input and generates a question . Then a paraphrasing model gives a score , which is computed between the generated question and the original question , as the ranking score.

We get the final scores used for ranking as follows.

Figure 2: Approach overview.

Moreover, we incorporate side information from external KBs into the three models. The details of how we use external KBs to enhance the representation of elements in the document KB will be described in section Incorporating External Knowledge.

The Question Answering Model

In this section, we present two matching models to measure the semantic relatedness between the question and the candidate answer in the vector semantic space. Afterwards, we introduce our strategy that incorporates open KB as external knowledge to enhance both models.

QA Model 1: PCNet

We follow bordes2014question,dong-EtAl:2015:ACL-IJCNLP1 bordes2014question,dong-EtAl:2015:ACL-IJCNLP1 and develop PCNet, which is short for path- and context- based neural network. In PCNet, candidate answers come from arguments of the document KB, and each candidate answer is represented with its neighboring arguments and predicates as well as its path from the anchor in the document KB. We use a rule-based approach based on string fuzzy match to detect the anchor. Each argument is measured by

, where and iterate across argument words and question words, respectively. is an indicator function whose value is 1 if the minimum edit distance between and is no more than 1, otherwise it is 0. The arguments linked to the anchor with 1-hop and 2-hop paths are regarded as candidate answers. Since an argument might include multiple words, such as “the popular angry bird game”, we use GRU based RNN to get the vector representation of each argument/predicate. The path representation is computed by averaging the vectors of elements in the path. Similarly, we use another RNN to get the vector of each neighboring argument/predicate, and average them to get the context vector .

We represent the question using a bidirectional RNN layer based on the GRU unit. The concatenation of the last hidden vectors from both directions is used as the question vector . The dot product is used to measure the semantic relevance between the b b question and two types of evidence.


QA Model 2: KVMemNet

Despite the interpretability of PCNet, the coverage of anchor detection limits the upper bound of the approach. We implement a more powerful method based on the key-value memory network [Miller et al.2016], KVMemNet for short, which has proven powerful in KB-based question answering.

The KVMenNet could be viewed as a “soft” matching approach, which includes a parameterized memory consisting of key-value pairs. Intuitively, keys are used for matching with the question, and values are used for matching to the candidate answer. Given a KB fact (), We consider both directions and add two key-value pairs in the memory, namely () and (). The vectors of arguments/predicates are calculated the same way as described in PCNet. The concatenation of two vectors is used as the key vector .

Each memory item is assigned a relevance probability by comparing the question to each key.


Afterwards, vectors in memory () are weighted summed according to their addressing probabilities, and the vector is returned.


In PCNet, reasoning over two facts is achieved by incorporating 2-hop paths, while in KVMemNet this is achieved by repeating the memory access process twice. After receiving the result , we update the query with , where is model parameter. Finally, after a fixed number hops ( in this work), the resulting vector is used to measure the relevance to candidate answers via the dot product.

Training and Inference

Let be the training data consisting of questions paired with their correct answer

. We train both matching models with a margin-based ranking loss function, which is calculated as follows, where

is the margin (fixed to ) and is randomly sampled from a set of incorrect candidates .


For testing, given a question , the model predicts the answer based on the following equation, where is the candidate answer set.


The Question Generation Model

In this section, we present the generation model which generates a question based on the semantics of a candidate answer. Afterward, we introduce how our paraphrasing model, which measures the semantic relevance between the generated question and the original question, is pretrained.

Hierarchical seq2seq Generation Model

Our question generation model takes the path between an “anchor” and the candidate answer, and outputs a question. We adopt the sequence to sequence architecture as our basic question generation model due to its effectiveness in natural language generation tasks. The hierarchical encoder and the hierarchical attention mechanism are introduced to take into account the structure of the input facts. Facts from external KBs are conventionally integrated into the model using the same way as described in the matching model.

Figure 3: The question generation model.

As illustrated in Figure 3, the question generation model contains an encoder and a decoder. We use a hierarchical encoder consisting of two layers to model the meaning of each element (subject, predicate or object) and the relationship between them. Since each element might contain multiple words, we use RNN as the word layer encoder to get the representation of each element. We define the fact level sequence as a path starting from the anchor and ending at the candidate answer. Another RNN is used as the fact level encoder. The last hidden state at the fact layer is fed to the decoder.

We develop a hierarchical attention mechanism in the decoder, which first makes soft alignment over the hidden states at the fact layer, the output of which is further used to attend to hidden states at the word level. Specifically, given the decoder hidden state

, we use a fact-level attention model

to calculate the contextual vector, which is further combined with the current hidden state , resulting in . The contextual vector is calculated through weighted averaging over hidden vectors at the fact level, which is given as follows, where is dot product, is the -th hidden state at the fact layer from the encoder, and is the number of hidden states at the fact level.


Similarly, we feed to the word-level attention function and calculate over hidden vectors at the word-level. The output will be concatenated with to predict the next word.

Since many entity names of great importance are rare words from the input, we use the copying mechanism [Gu et al.2016]

that learns when to replicate words from the input or to predict words from the target vocabulary. The probability distribution of generating the word

is calculated as follows, in which the

function is calculated over a combined logits from both sides.


We train our question generation model with maximum likelihood estimation. The loss function is given as follows, where

is the training corpus. We use beam search in the inference process.


An advantage of the model is that external knowledge could be easily incorporated with the same mechanism as we have described in section Incorporating External Knowledge. We enhance the representation of an argument or a predicate by concatenating open KB vectors into encoder hidden states.

The Paraphrasing Model

The paraphrasing model is used to measure the semantic relevance between the original question and the question generated from the QG model. We use bidirectional RNN with gated recurrent unit to represent two questions, and compose them with element-wise multiplication. The results are followed by a

layer, whose output length is 2. The model is trained by minimizing the cross-entropy error, in which the supervision is provided in the training data.

We collect two datasets to train the paraphrasing model. The first dataset is from Quora dataset222

, which is built for detecting whether or not a pair of questions are semantically equivalent. 345,989 positive question pairs and 255,027 negative pairs are included in the first dataset. The second dataset includes web queries from query logs, which are obtained by clustering the web queries that click the same web page. In this way we obtain 6,118,023 positive query pairs. We implement a heuristic rule to get 6,118,023 negative instances for the query dataset. For each pair of query text, we clamp the first query and retrieve a query that is mostly similar to the second query. To improve the efficiency of this process, we randomly sample 10,000 queries and define the “similarity” as the number of co-occurred words between two questions. During training, we initialize the values of word embeddings with

Glove vectors333, which is learned on Wikipedia texts. We use a held-out data consisting of 20K query pairs to check the performance of the paraphrasing model. The accuracy of the paraphrasing model on the held-out dataset is 87.36%.

Incorporating External Knowledge

There are many possible ways to implement the idea of improving question answering with external KB. In this work, we use external KBs (such as NELL and ProBase) to enhance the representations of elements in the document KB. For instance, the argument “the sequence of amino acids” in Figure 1 from the document KB retrieves (“amino acids”, ‘is”, “protein”) from NELL. Enhanced with this additional clue, the original argument is a better match to the question.

Similar to khot2017answering khot2017answering, we use ElasticSearch444 to retrieve facts from open KBs. We remove stop words from tokens of each argument and predicate in a document KB and regard the remained words as ElasticSearch queries. We set different search option for arguments and predicates, namely setting arguments as the dominant searchable fields for argument queries and setting predicates as the dominant searchable fields for predicate queries. We save the top 10 hints and selectively use them in the experiment.

We regard the retrieved facts from the external KB as neighbors to the arguments to be enhanced. Inspired by scarselli2009graph scarselli2009graph, we update the vector of an element as follows, where and represent adjacent arguments from the facts retrieved by object and subject, respectively. In this work, is implemented by averaging the vectors of two arguments.

Figure 4: Model output of KVMemNet on the dev set. We show key-value pairs in the memory, and the probabilities of each hop used during inference. (s), (p) and (o) stand for subject, predicate, and object, respectively.

Related Work

The task of KBMRC differs from machine reading comprehension (MRC) in both input and output aspects. The input of KBMRC is the knowledge including both word knowledge extracted from the document and world knowledge retrieved from external knowledge base, while the input of MRC is the unstructured text of a document. The output of KBMRC is a subject or an argument, while the output in MRC is a text span of the document. Meanwhile, KBMRC facilitates the accessing and leveraging of knowledge from external KBs because the document KB is consistent with the representation of facts in external KBs.

KBMRC also relates to knowledge-base question answering (KBQA) [Yih et al.2015], which aims to answer questions based on an external large-scale KB such as Freebase or ProBase. KBMRC differs from KBQA in that the original KB comes from the content of a document. External KB is used in this work to enhance the document KB. Moreover, existing benchmark datasets for KBQA such as WebQuestions [Berant et al.2013] are typically limited to simple questions. The KBMRC task requires reasoning over two facts from the document KB.

Our approach draws inspiration from two main classes in existing approaches of KBQA, namely ranking based and parsing based. Ranking based approaches [Bordes, Chopra, and Weston2014, Berant and Liang2014] are bottom-up, which typically first find a set of candidate answers and then rank between the candidates with features at different levels to get the answer. Parsing-based approaches [Fader, Zettlemoyer, and Etzioni2014] are top-down, which first interpret logical form from a natural language utterance, and then do execution to yield the answer. Ranking-based approaches achieve better performances than parsing-based approaches on WebQuestions, a benchmark dataset for KBQA. We follow ranking-based approaches, and develop both a matching-based model with features at different levels and a question generation model. More references can be found at

Our work also relates to khot2017answering khot2017answering, which uses open IE outputs from external text corpora to improve multi-choice question answering. However, our work differs from them in that their task does not contain document information. Furthermore, we develop a question generation approach while they regard the QA task as subgraph search based on an integer linear programming (ILP) approach. Our work also relates to khashabi2018question khashabi2018question, which focuses on multi-choice question answering based on the semantics of a document. They use semantic role labeling and shallow parsing of a document to construct a semantic graph, based on which an ILP based approach is developed to find the supporting subgraph. The difference of our approach is that predicates from our document KB form are not limited to a predefined set, so that they do not take into consideration the knowledge from external KBs, and also the difference in terms of methodology. miller2016key miller2016key answer questions based on KBs in the movie domain or information extraction results from Wikipedia documents. Unlike this method, our approach focuses on entities from an external KB, our doc KB is obtained via open IE, and we combine the document KB with an open KB for question answering.


We describe experiment settings and report figures and analysis in this section.


In our experiments, we tune model parameters on the development set and report results on the test set. We design experiments from both ranking-based direction and question generation-based direction. The evaluation metric is precision @1

[Bordes, Chopra, and Weston2014], which indicates whether the top ranked result is the correct answer. We further report BLEU score [Papineni et al.2002] for the question generation approach.

We also adapt BiDAF [Seo et al.2017], a top-performing reading comprehension model on the SQuAD dataset [Rajpurkar et al.2016] as a strong baseline. As BiDAF output a span from the input document as the answer to the given question, we adapt it to KBMRC as a ranking model similarly as the approach used in previous research [Khashabi et al.2018]. We use BiDAF to select an answer span from a corresponding document based on a given question and select the candidate answer that has maximum overlap with the answer span as the final answer.

Analysis: Question Answering Models

Table 3 shows the results of our two question answering models. It is clear that KVMemNet achieves better P@1 scores on both dev and test sets than PCNet. The reason is that candidate answers of PCNet come from the “anchor” point along 1-hop or 2-hop paths. However, the correct answer might not be connected due to the quality of anchor detection. On the dev set, we observe that only 69.6% of correct answers can be covered by the set of candidate answers in PCNet, which apparently limits the upper bound of the approach. This is addressed in KVMemNet because all the arguments are candidate answers. Both PCNet and KVMemNet outperform our implementation of bordes2014question bordes2014question, since the latter ignores word order. We incorporate each of the four KBs separately into PCNet and KVMemNet, and find that incorporating external KBs could bring improvements.

Method Dev Test
[Bordes, Chopra, and Weston2014] 48.6 48.0
BiDAF 65.4 62.3
PCNet 50.7 49.4
  + NELL 54.0 52.9
  + Reverb 54.1 52.6
  + Probase 54.8 53.5
  + Freebase 54.7 53.1
KVMemNet 63.4 63.6
  + NELL 64.8 63.8
  + Reverb 64.4 64.3
  + Probase 64.0 63.7
  + Freebase 64.1 63.7
Table 3: Performances (P@1) of different approaches on dev and test sets.

From Figure 4, we can see that the KVMemNet model attends to the key “(st. johns, is located)” for the question “Where is st johns mi located?”. Thus, the model has higher confidence in regarding value “in clinton county” as the answer.

Analysis: Generative Models

Table 4 shows the results of different question generation models. Our approach is abbreviated as QGNet, which stands for the use of a paraphrasing model plus our question generation model. We can see that QGNet performs better than Seq2Seq in terms of BLEU score because many important words of low frequency from the input are replicated to the target sequence. However, the improvement is not significant for the QA task. We also incorporate each of the four KBs into QGNet, and observe slight improvements on NELL and Reverb. Despite the overall accuracy of QGNet being lower than PCNet and KVMemNet, combining outcomes with them could generates 1.5% and 0.8% absolute gains, respectively.

Method BLEU P@1
Seq2Seq 11.2 41.3
QGNet 16.3 41.5
  + NELL 16.5 42.2
  + Reverb 16.2 42.1
  + Probase 16.5 41.8
  + Freebase 16.1 41.3
QGNet + PCNet 16.3 50.9
QGNet + KVMemNet 16.3 64.3
Table 4: Performances of different question generation based systems on the test set.

We show examples generated by our QG model in Figure 5, in which the paths of two candidate answers are regarded as the input to the QG model. We can see that the original question is closer to the first generated result than the second one. Accordingly, the first candidate ($ 61,300) would be assigned with a larger probability as the answer.

Figure 5: Example prediction of QGNet on the dev set, in which the first candidate is the correct answer while the second is incorrect.

Future Opportunities for Future Research

We conduct error analysis from multiple perspectives to show the limitations and future opportunities of different components of our approach.

Question Answering Model.

We study the limitations of KVMemNet by analyzing incorrectly predicted results. We observe that the majority of errors are caused by mismatch between the question and the key of the correct answer. One type of errors occurs when the subject is a pronoun. For example, the question “What is the cost of caltech’s solar-powered toilet” hardly matches to the fact “(it, will take, $1,500 to $2,000

)”, in which the subject and predicate are the key and the object is the value. Incorporating coreference resolution information might mitigate this issue. Another type of error occurs when the key is a long sequence, in which the keyword is a part of it but not well matched to the question. Incorporating named entity recognition and leveraging it for question-key matching might be a potential direction for tackling this issue.

Question Generation Model.

We analyze a randomly selected set of generated questions by QGNet, and categorize unsatisfied results into two groups. The first group is generating duplicate words, which might be solved by incorporating a coverage mechanism. The second group is replicating incomplete spans from the input. In this work, we do not design the model architecture to replicate an entire argument to the target sequence because a subject/objective might be too long. However, word-level copying mechanism does not guarantee that a text span consisting of multiple words (such as named entity) will be successively copied.

External KB Retrieval.

We observe randomly selected retrieved results from open KBs, and find that ElasticSearch performs pretty well in most cases that balances between accuracy and latency. However, there still remain many cases in which named entities are partly matched, so that the retrieved results talk about totally different things. The problem can be mitigated by taking into consideration named entity information in both search queries and searchable values in ElasticSearch. Despite the retrieved results being correct in terms of string match, some entities are ambiguous with our current strategy unable not distinguish between them. For instance, the retrieved results for “Louisiana” include “(LOUISIANA, is a company in the economic sector of, services)” and “(LOUISIANA, is a state or province located in the geopolitical location, U.S. )”, while only the second fact is the correct one. Incorporating a disambiguation system is a potential solution to this problem.


In this paper, we focus on knowledge based machine reading comprehension. We create a manually labeled dataset for the task, and develop a framework consisting of both question answering model and question generation model. We further incorporate four open KBs as external knowledge into both question answering and generative approaches, and demonstrate that incorporating additional evidence from open KBs improves total accuracy. We conduct extensive model analysis and error analysis to show the advantages and limitations of our approaches.


  • [Berant and Liang2014] Berant, J., and Liang, P. 2014. Semantic parsing via paraphrasing. In ACL (1), 1415–1425.
  • [Berant et al.2013] Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP, volume 2,  6.
  • [Bollacker et al.2008] Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1247–1250. AcM.
  • [Bordes, Chopra, and Weston2014] Bordes, A.; Chopra, S.; and Weston, J. 2014. Question answering with subgraph embeddings. arXiv preprint arXiv:1406.3676.
  • [Carlson et al.2010] Carlson, A.; Betteridge, J.; Kisiel, B.; Settles, B.; Hruschka Jr, E. R.; and Mitchell, T. M. 2010. Toward an architecture for never-ending language learning. In AAAI, volume 5,  3.
  • [Carrell1983] Carrell, P. L. 1983. Three components of background knowledge in reading comprehension 1. Language learning 33(2):183–203.
  • [Del Corro and Gemulla2013] Del Corro, L., and Gemulla, R. 2013. Clausie: clause-based open information extraction. In Proceedings of the 22nd international conference on WWW, 355–366.
  • [Dong et al.2015] Dong, L.; Wei, F.; Zhou, M.; and Xu, K. 2015.

    Question answering over freebase with multi-column convolutional neural networks.


    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    , 260–269.
    Beijing, China: Association for Computational Linguistics.
  • [Fader, Soderland, and Etzioni2011] Fader, A.; Soderland, S.; and Etzioni, O. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1535–1545. Association for Computational Linguistics.
  • [Fader, Zettlemoyer, and Etzioni2014] Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2014. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 1156–1165. ACM.
  • [Gu et al.2016] Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1631–1640. Berlin, Germany: Association for Computational Linguistics.
  • [Hill et al.2015] Hill, F.; Bordes, A.; Chopra, S.; and Weston, J. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
  • [Hirsch2003] Hirsch, E. D. 2003. Reading comprehension requires knowledge of words and the world. American Educator 27(1):10–13.
  • [Jia and Liang2017] Jia, R., and Liang, P. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2021–2031. Association for Computational Linguistics.
  • [Khashabi et al.2018] Khashabi, D.; Khot, T.; Sabharwal, A.; and Roth, D. 2018. Question answering as global reasoning over semantic abstractions.

    Conference of Association for the Advancement of Artificial Intelligence

  • [Khashabi, Sabharwal, and Roth2018] Khashabi, D.; Sabharwal, T. K. A.; and Roth, D. 2018. Question answering as global reasoning over semantic abstractions.
  • [Khot, Sabharwal, and Clark2017] Khot, T.; Sabharwal, A.; and Clark, P. 2017. Answering complex questions using open information extraction. arXiv preprint arXiv:1704.05572.
  • [Kocisky et al.2018] Kocisky, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, K. M.; Melis, G.; and Grefenstette, E. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6:317–328.
  • [Miller et al.2016] Miller, A.; Fisch, A.; Dodge, J.; Karimi, A.-H.; Bordes, A.; and Weston, J. 2016. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
  • [Onishi et al.2016] Onishi, T.; Wang, H.; Bansal, M.; Gimpel, K.; and McAllester, D. 2016. Who did what: A large-scale person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2230–2235.
  • [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
  • [Rajpurkar et al.2016] Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  • [Rajpurkar, Jia, and Liang2018] Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 784–789. Association for Computational Linguistics.
  • [Scarselli et al.2009] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
  • [Seo et al.2017] Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017. Bidirectional attention flow for machine comprehension. In Proc. ICLR’17.
  • [Welbl, Stenetorp, and Riedel2018] Welbl, J.; Stenetorp, P.; and Riedel, S. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6:287–302.
  • [Wu et al.2012] Wu, W.; Li, H.; Wang, H.; and Zhu, K. Q. 2012. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 481–492. ACM.
  • [Yan et al.2018] Yan, Z.; Tang, D.; Duan, N.; Liu, S.; Wang, W.; Jiang, D.; Zhou, M.; and Li, Z. 2018. Assertion-based qa with question-aware open information extraction. In AAAI.
  • [Yih et al.2015] Yih, W.-t.; Chang, M.-W.; He, X.; and Gao, J. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, 1321–1331.