Evidence Sentence Extraction for Machine Reading Comprehension

02/23/2019 ∙ by Hai Wang, et al. ∙ University of Pennsylvania Tencent cornell university Toyota Technological Institute at Chicago 0

Recently remarkable success has been achieved in machine reading comprehension (MRC). However, it is still difficult to interpret the predictions of existing MRC models. In this paper, we focus on: extracting evidence sentences that can explain/support answer predictions for multiple-choice MRC tasks, where the majority of answer options cannot be directly extracted from reference documents; studying the impacts of using the extracted sentences as the input of MRC models. Due to the lack of ground truth evidence sentence labels in most cases, we apply distant supervision to generate imperfect labels and then use them to train a neural evidence extractor. To denoise the noisy labels, we treat labels as latent variables and define priors over these latent variables by incorporating rich linguistic knowledge under a recently proposed deep probabilistic logic learning framework. We feed the extracted evidence sentences into existing MRC models and evaluate the end-to-end performance on three challenging multiple-choice MRC datasets: MultiRC, DREAM, and RACE, achieving comparable or better performance than the same models that take the full context as input. Our evidence extractor also outperforms a state-of-the-art sentence selector by a large margin on two open-domain question answering datasets: Quasar-T and SearchQA. To the best of our knowledge, this is the first work addressing evidence sentence extraction for multiple-choice MRC.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently there have been increased interests in machine reading comprehension (MRC). We can roughly divide MRC tasks into two categories: 1): extractive/abstractive MRC such as SQuAD Rajpurkar et al. (2016), NarrativeQA Kočiskỳ et al. (2018), and CoQA Reddy et al. (2018); 2): multiple-choice MRC tasks such as MCTest Richardson et al. (2013), DREAM Sun et al. (2019) and RACE Lai et al. (2017). The MRC tasks in the first category primarily focus on locating text spans from the given reference document/corpus to answer informative factoid questions. In this work, we mainly focus on multiple-choice MRC: given a document and a question, the task aims to select the correct answer option(s) from a small number of answer options associated with this question.

Existing multiple-choice MRC models Wang et al. (2018b); Radford et al. (2018) take the whole reference document as input and seldom provide evidence snippets, making interpreting their predictions extremely difficult. It is a natural choice for human readers to use several sentences from the reference document to explain why they select a certain answer option in reading tests Bax (2013). In this paper, as a preliminary attempt, we focus on exacting evidence sentences that entail or support a question-answer pair from the reference document and investigating how well a neural reader can answer multiple-choice questions by just using extracted sentences as the input.

From the perspective of evidence sentence extraction, for extractive MRC tasks, information retrieval techniques can already serve as very strong baselines especially when questions provide sufficient information, and most questions are answerable from the content of a single sentence Lin et al. (2018); Min et al. (2018). For multiple-choice tasks, there are some unique challenges for evidence sentence extraction. The correct answer options of a significant number of questions (e.g., questions in RACE Lai et al. (2017); Sun et al. (2019)) are not extractive, which require advanced reading skills such as inference over multiple sentences and utilization of prior knowledge Lai et al. (2017); Khashabi et al. (2018); Ostermann et al. (2018). Besides, the existence of misleading distractors (i.e., wrong answer options) also dramatically increases the difficulty of extracting evidence sentences, especially when a question provides insufficient information. For example, in Figure 1, given the reference document and the question “Which of the following statements is true according to the passage?”, almost all the tokens in the wrong answer option B “In 1782, Harvard began to teach German.” appear in the document (i.e., sentence S and S). Furthermore, we notice that even humans sometimes have difficulty in finding pieces of evidence when the relationship between a question and its correct answer option is implicitly indicated in the document (e.g., “What is the main idea of this passage?”). Considering these challenges, we argue that extracting evidence sentences for multiple-choice MRC is at least as difficult as that for extractive MRC or factoid question answering.

Given a question, its associated answer options, and a reference document, we propose a method to extract sentences that can support or explain the (question, correct answer option) pair from the reference document. Due to the lack of ground truth evidence sentences in most multiple-choice MRC datasets, inspired by distant supervision, we first select silver standard evidence sentences based on the lexical features of a question and its correct answer option (Section 2.3.1), then we use these noisy labels to train an evidence sentence extraction model (Section 2.3.2). To denoise the distant supervision, we leverage rich linguistic knowledge from external resources such as ConceptNet Speer et al. (2017) and Paraphrase Database Pavlick et al. (2015), and we accommodate all those indirect supervision with a recently proposed deep probabilistic logic learning Wang and Poon (2018) framework (Section 2.2). We combine our evidence extractor with two recent neural readers Wang et al. (2018b); Radford et al. (2018) and evaluate the end-to-end performance on three challenging multiple-choice MRC datasets: MultiRC Khashabi et al. (2018), DREAM Sun et al. (2019), and RACE Lai et al. (2017). Experimental results show that we achieve comparable or better performance than baselines that consider the full context, indirectly demonstrating the quality of our extracted sentences. We also compare our evidence extractor with a recently proposed sentence selector Lin et al. (2018). Our extractor significantly outperforms the baseline selector in filtering out noisy retrieved paragraphs on two open-domain factoid question answering datasets: Quasar-T Dhingra et al. (2017a) and SearchQA Dunn et al. (2017).

Our primary contributions are as follows: 1) to the best of our knowledge, we present the first work to extract evidence sentences for multiple-choice MRC; 2) we utilize various sources of indirect supervision derived from linguistic knowledge to denoise the noisy evidence sentence labels and demonstrate the value of linguistic knowledge for MRC. We hope our attempts and observations can encourage the research community to develop more explainable models that simultaneously provide predictions and textual evidence.

Figure 1: An overview of our pipeline. The input instance comes from RACE Lai et al. (2017).

2 Method

Our pipeline contains a neural evidence extractor (Section 2.1) trained on the noisy training data generated by distant supervision (Section 2.2) and an existing neural reader (Section 4.1 and 4.2) for answer prediction that takes evidence sentences as input. We detail the entire pipeline in Section 2.3 and show an overview in Figure 1.

2.1 Transformer

We primarily use a multi-layer multi-head transformer Vaswani et al. (2017) to extract evidence sentences. Let and be the word (subword) and position embeddings, respectively. Let denote the total number of layers in the transformer. Then, the -th layer hidden state of a token is given by:


where TB stands for the transformer block, which is a standard module that contains MLP, residual connections 

He et al. (2016) and LayerNorm Ba et al. (2016).

Recently, several pre-trained transformers such as GPT Radford et al. (2018) and BERT Devlin et al. (2018) have been released. Compared to RNNs such as LSTMs Hochreiter and Schmidhuber (1997) and GRUs Cho et al. (2014), pre-trained transformers capture rich world and linguistic knowledge from large-scale external corpora, and significant improvements are obtained by fine-tuning these pre-trained models on several downstream tasks. We follow this promising direction by fine-tuning GPT Radford et al. (2018). Note that the pre-trained transformer in our pipeline can also be easily replaced by BERT.

We use to denote all training data, to denote each instance, where is a token sequence, namely, where equals to the sequences length. For evidence extraction, contains one sentence in a document, a question, and all answer options associated with the question.

indicates the probability that sentence

is selected as an evidence sentence for this question, and where equals to the total number of sentences in a document. The transformer takes as input and produces the final hidden state of the last token in  Radford et al. (2018)

, which is further fed into a linear layer followed by a softmax layer to generate the probability:


where is the weight matrix for the output layer. Kullback-Leibler (KL) divergence loss is used as training criteria.

2.2 Deep Probabilistic Logic

Since human-labeled evidence sentences are seldom available in existing machine reading comprehension datasets, we use distant supervision to generate weakly labeled evidence sentences: we know the correct answer options, then we can select the sentences in the reference document that have the highest information overlapping with the question and the correct answer option. However, weakly labeled data generated by distant supervision is inevitably noisy Bing et al. (2015), and therefore we need a denoising strategy that can leverage various sources of indirect supervision.

In this paper, we use Deep Probabilistic Logic (DPL) Wang and Poon (2018)

, a unifying denoise framework that can efficiently model various indirect supervision by integrating probabilistic logic with deep learning. It consists of two modules: 1) a supervision module that represents indirect supervision using probabilistic logic; 2) a prediction module that uses deep neural networks to perform the downstream task. The label decisions derived from indirect supervision are modeled as latent variables and serve as the interface between the two modules. DPL combines three sources of indirect supervision: data programming, distant supervision, and joint inference. For data programming, we introduce a set of labeling functions that are specified by simple rules and written by domain experts, and each function assigns a label to an instance if the input satisfies certain conditions. We will detail these sources of indirect supervision under our task setting in Section 


Formally, let be a set of indirect supervision signals, which has been used to incorporate label preference and derived from prior knowledge. DPL comprises of a supervision module over and a prediction module over (, ), where is latent in DPL:


Without loss of generality, we assume all indirect supervision are log-linear factors, which can be compactly represented by weighted first-order logical formulas Richardson and Domingos (2006). Namely, , where is a feature represented by a first-order logical formula, is a weight parameter for and is initialized according to our prior belief about how strong this feature is222Once initial weights can reasonably reflect our prior belief, the learning is stable.. The optimization of DPL amounts to maximizing (e.g., variational EM formulation), and we can use EM-like learning approach to decompose the optimization over the supervision module and prediction module. See haidpl2018 for more details about optimization.

2.3 Our Pipeline

Figure 2: Deep probabilistic logic framework for evidence extraction. At test time, we only use trained neural evidence extractor for prediction.

As shown in Figure 2, in training stage, our evidence extractor contains two components: a probabilistic graph containing various sources of indirect supervision used as a supervision module (Section 2.2) and a fine-tuned pre-trained transformer (Section 2.1) used as a prediction module. The two components are connected via a set of latent variables indicating whether each sentence is an evidence sentence or not. We update the model by alternatively optimizing the transformer and the probabilistic graph so that they reach an agreement on latent variables. After training, only the transformer is kept to make predictions for a new instance during testing.

As we mentioned in Section 2.2, DPL can jointly represent three sources of different indirect supervision. We first introduce two distant supervision methods to generate noisy evidence sentence labels (Section 2.3.1). We then introduce other sources of indirect supervision — data programming and joint inference — used for denoising in DPL (Section 2.3.2).

2.3.1 Silver Standard Evidence Generation

Given correct answer options, we use two different distant supervision methods to generate the silver standard evidence sentences.

Rule-Based Method We select sentences that have higher weighted token overlap with a given (question, correct answer options) pair as silver standard evidence sentences. Tokens are weighted by the inverse term frequency.

Integer Linear Programming (ILP)

Inspired by ILP models for summarization Berg-Kirkpatrick et al. (2011); Boudin et al. (2015), we model evidence sentence selection as a maximum coverage problem and define the value of a selected sentence set as the sum of the weights for the unique words it contains. Formally, let denote the weight of word , if word appears in the correct answer option, if it appears in the question but not in the correct answer option, and otherwise.333We do not observe a significant improvement by tuning parameters on the development set.

We use binary variables

and to indicate the presence of word and sentence in the selected sentence set, respectively. is a binary variable indicating the occurrence of word in sentence , denotes the length of sentence , and is the predefined maximum number of selected sentences. We formulate the ILP problem as:


2.3.2 Denoising with DPL

Besides distant supervision, DPL also includes data programming and joint inference (i.e., in Section 2.2). As a preliminary attempt, we manually design a small number of sentence-level labeling functions for data programming and high-order factors for joint inference. We briefly introduce them as follows and list the implementation details in Appendix A.

For sentence-level functions, we consider lexical features (i.e., the sentence length, the entity types in a sentence, and sentence positions in a document), semantic features based on word and paraphrase embeddings and ConceptNet Speer et al. (2017)

triples, and rewards for each sentence from an existing neural reader, language inference model, and sentiment classifier, respectively.

For high-order factors, we consider factors including if whether adjacent sentences prefer the same label, the maximum distance between two evidence sentences that support the same question, and the token overlap between two evidence sentences that support different questions.

Figure 3: A simple factor graph for denoising.

We show the factor graph for a toy example in Figure 3, where the document contains two sentences and two questions. denotes an instance consisting of sentence , question and its associated options, is a latent variable indicating the probability that sentence is an evidence sentence for question . We build a factor graph for the document and all its associated questions jointly. By introducing the logic rules jointly over and , we can model the joint probability for .

3 Datasets

Dataset # of documents # of questions Average # of sentences per document
Train Dev Test Train Dev Test Train + Dev + Test
MultiRC 456 83 332 5,131 953 3,788 14.5 (Train + Dev)
DREAM 3,869 1,288 1,287 6,116 2,040 2,041 -
RACE 25,137 1,389 1,407 87,866 4,887 4,934 17.6
Quasar-T - - - 37,012 3,000 3,000 100
SearchQA - - - 99,811 13,893 27,247 50
Table 1: Statistics of multiple-choice machine reading comprehension and question answering datasets.

We primarily focus on extracting evidence sentences for multiple-choice machine reading comprehension. Three latest MRC datasets are investigated (Section 3.1). Additionally, to have a head-to-head comparison with existing sentence selectors designed for factoid question answering, we also evaluate our approach on two open-domain question answering datasets, in which answers are text spans (Section 3.2). See Table 1 for statistics.

3.1 Multiple-Choice Datasets

MultiRC Khashabi et al. (2018): MultiRC is a dataset in which questions can only be answered by considering information from multiple sentences. There can exist multiple correct answer options for a question. Reference documents come from seven different domains such as elementary school science and travel guides. For each document, questions and their associated answer options are generated and verified by turkers.

DREAM Sun et al. (2019): DREAM is a dataset collected from English Listening exams for Chinese language learners. Each instance in DREAM contains a multi-turn multi-party dialogue, and the correct answer option must be inferred from the dialogue context. In particular, a large portion of questions require multi-sentence inference () and/or commonsense knowledge ().

RACE Lai et al. (2017): RACE is a dataset collected from English reading exams designed for middle (RACE-Middle) and high school (RACE-High) students in China, carefully designed by English instructors. The proportion of questions that requires reasoning is .

3.2 Question Answering Datasets

Quasar-T Dhingra et al. (2017a): It contains open-domain questions and their associated answers extracted from ClueWeb09. For each question, sentences are retrieved from ClueWeb09 using information retrieval techniques.

SearchQA Dunn et al. (2017): For each question, dunn2017searchqa retrieve web pages from J! Archive as the relevant documents using the Google Search API.

4 Experiments

4.1 Implementation Details

We use spaCy Honnibal and Johnson (2015)

for tokenization and named entity tagging. We use the pre-trained transformer released by radfordimproving with the same pre-processing procedure. When the transformer is used as the neural reader, we set training epochs to 4, use eight P40 GPUs for experiments on RACE, and use one GPU for experiments on other datasets. When the transformer is used as the evidence extractor, we set batch size 1 per GPU and dropout rate

. We keep other parameters default. Depending on the dataset, training the evidence extractor generally takes several hours. Training neural readers with evidence sentences as input takes significant less time than that with full context as input.

For DPL, we adopt the toolkit from haidpl2018. We use Vader Gilbert (2014)

for sentiment analysis and ParaNMT-

Wieting and Gimpel (2018) to calculate the paraphrase similarity between two sentences. We use the triples in ConceptNet v Speer and Havasi (2012); Speer et al. (2017) to incorporate commonsense knowledge. To calculate the natural language inference probability, we first fine-tune the transformer Radford et al. (2018) on several tasks, including SNLI Bowman et al. (2015), SciTail Khot et al. (2018), MultiNLI Williams et al. (2018), and QNLI Wang et al. (2018a).

To calculate the probability that each sentence leads to the correct answer option, we sample a subset of sentences and use them to replace the full context in each instance, and then we feed them into the transformer fine-tuned with instances with full context. If a particular combination of sentences leads to the prediction of the correct answer option, we reward each sentence inside this set with . To avoid the combinatorial explosion, we assume evidence sentences lie within window size . For another neural reader Co-Matching Wang et al. (2018b), we use its default parameters. For DREAM and RACE, we set , the maximum number of silver standard evidence sentences of a question, to . For MultiRC, we set to 5 since many questions have more than ground truth evidence sentences.

During training, we conduct message passing in (Section 2.2) iteratively, which usually converges within iterations. For distant supervision (Section 2.3.1), we use the rule-based method to generate noisy labels for all the datasets except for RACE. On RACE, we use ILP-based method since we find the ILP-based method works better than the rule-based method on this dataset. The data programming and joint inference supervision on each dataset are slightly different. We will detail the differences in each subsection.

4.2 Results on Multiple-Choice Datasets

All-ones baseline Khashabi et al. (2018) 61.0 59.9 0.8
Lucene world baseline Khashabi et al. (2018) 61.8 59.2 1.4
Lucene paragraphs baseline Khashabi et al. (2018) 64.3 60.0 7.5
Logistic regression Khashabi et al. (2018) 66.5 63.2 11.8
Full context + Fine-Tuned Transformer (FT, radfordimproving) 68.7 66.7 11.0
Random 5 sentences + FT 65.3 63.1 7.2
Top 5 sentences by 70.2 68.6 12.7
Top 5 sentences by 70.5 67.8 13.3
Top 5 sentences by 72.3 70.1 19.2
Ground truth evidence sentences + FT 78.1 74.0 28.6
Human Performance Khashabi et al. (2018) 86.4 83.8 56.6
Table 2: Performance of various settings on the MultiRC development set. We use the same fine-tuned transformer (FT) as the evidence extractor (EER) and the neural reader (: EER trained on the silver standard evidence sentences; : EER trained with DPL as a supervision module; : EER trained using ground truth evidence sentences; macro-average F1; : micro-average F1; : exact match).

4.2.1 Evaluation on MultiRC

Since its test set is not publicly available, currently we only evaluate our model on the development set (Table 2). Figure 4 shows the precision-recall curves. The fine-tuned transformer (FT) baseline, which uses the full document as input, achieves an improvement of in macro-average F1 () over the previous highest score, . If we train our evidence extractor using the ground truth evidence sentences provided by turkers, we can obtain a much higher , even after we remove nearly of sentences in average per document. We can regard this result as the supervised upper bound for our evidence extractor. If we train the evidence extractor with DPL as a supervision module, we get in . The performance gap between and shows there is still room for improving our denoising strategy.

Figure 4: Precision-recall curves for different settings on the MultiRC development set (IR: information retrieval baseline; LR: logistic regression baseline implemented by khashabi2018looking).

4.2.2 Evaluation on DREAM

See Table 3 for results on DREAM dataset. The fine-tuned transformer (FT) baseline, which uses the full document as input, achieves test accuracy . If we train our evidence extractor with DPL as a supervision module and feed the extracted evidence sentences to the fine-tuned transformer, we get test accuracy . Similarly, if we train the evidence extractor only with silver standard evidence sentences extracted from the rule-based distant supervision method, we obtain test accuracy , i.e., lower than that with full supervision. Experiments demonstrate the effectiveness of our evidence extractor with denoising strategy, and the usefulness of evidence sentences for dialogue-based machine reading comprehension.

Approach Dev Test
Full context + FT Sun et al. (2019) 55.9 55.5
Full context + FT 55.1 55.1
Top 3 sentences by  + FT 50.1 50.4
Top 3 sentences by 55.1 56.3
Top 3 sentences by 57.3 57.7
Silver standard evidence sentences + FT 60.5 59.8
Human Performance 93.9 95.5
Table 3: Performance in accuracy (%) on the DREAM dataset (Results marked with are taken from sundream2018; : EER trained using silver standard evidence sentences).

4.2.3 Evaluation on RACE

On RACE, as we cannot find any public implementations of recently published independent sentence selectors, we compare our evidence sentence extractor with InferSent released by conneau-EtAl as previous work Htut et al. (2018) has shown that it outperforms many state-of-the-art sophisticated sentence selectors on a range of tasks. We also investigate the portability of our evidence extractor by combing it with two neural readers. Besides the fine-tuned transformer, we use Co-Matching Wang et al. (2018b), another state-of-the-art neural reader on RACE.

As shown in Table 4, by using the evidence sentences selected by InferSent, we suffer up to a drop in accuracy with Co-Matching and up to a drop with the fine-tuned transformer. In comparison, by using the sentences extracted by our sentence extractor, which is trained with DPL as a supervision module, we observe a much smaller decrease () in accuracy with the transformer baseline, and we slightly improve the accuracy with the Co-Matching baseline. For questions in RACE, introducing the content of answer options as additional information for evidence extraction can narrow the accuracy gap, which might be due to the fact that many questions are less informative Xu et al. (2018). Note that all these results are compared with reported from  radfordimproving, if compared with our own replication (), sentence extractor trained with either DPL or distant supervision leads to gain up to .

Since the problems in RACE are designed for human examinees that require advanced reading comprehension skills such as the utilization of external world knowledge and in-depth reasoning, even human annotators sometimes have difficulties in locating evidence sentences (Section 4.2.4). Therefore, a limited number of evidence sentences might be insufficient for answering challenging questions. Instead of removing “non-relevant” sentences, we keep all the sentences in a document while adding a special token before and after extracted evidence sentences. With DPL as a supervision module, we see an improvement in accuracy of (from to ).

For our current supervised upper bound (i.e., assuming we know the correct answer option, we find the silver evidence sentences from ILP-based distant supervision and then feed them into the fine-tuned transformer, we get in accuracy, which is quite close to the performance of Amazon Turkers. However, it is still much lower than the ceiling performance. To answer questions that require external knowledge, it might be a promising direction to retrieve evidence sentences from external resources, compared to only considering sentences within a reference document.

Approach Dev Test
Middle High All Middle High All
Sliding Window Richardson et al. (2013); Lai et al. (2017) - - - 37.3 30.4 32.2
Co-Matching Wang et al. (2018b) - - - 55.8 48.2 50.4
Full context + FT Radford et al. (2018) - - - 62.9 57.4 59.0
Full context + FT 55.6 56.5 56.0 57.5 56.5 56.8
Random 3 sentences + FT 50.3 51.1 50.9 50.9 49.5 49.9
Top 3 sentences by InferSent (question) + Co-Matching 49.8 48.1 48.5 50.0 45.5 46.8
Top 3 sentences by InferSent (question + all options) + Co-Matching 52.6 49.2 50.1 52.6 46.8 48.5
Top 3 sentences by + Co-Matching 58.1 51.6 53.5 55.6 48.2 50.3
Top 3 sentences by + Co-Matching 57.5 52.9 54.2 57.5 49.3 51.6
Top 3 sentences by InferSent (question) + FT 55.0 54.7 54.8 54.6 53.4 53.7
Top 3 sentences by InferSent (question + all options) + FT 59.2 54.6 55.9 57.2 53.8 54.8
Top 3 sentences by + FT 62.5 57.7 59.1 64.1 55.4 58.0
Top 3 sentences by + FT 63.2 56.9 58.8 64.3 56.7 58.9
Top 3 sentences by + full context + FT 63.4 58.6 60.0 63.7 57.7 59.5
Top 3 sentences by + full context + FT 64.2 58.5 60.2 62.4 58.7 59.8
Silver standard evidence sentences + FT 73.2 73.9 73.7 74.1 72.3 72.8
Amazon Turker Performance Lai et al. (2017) - - - 85.1 69.4 73.3
Ceiling Performance Lai et al. (2017) - - - 95.4 94.2 94.5
Table 4: Accuracy (%) of various settings on the RACE dataset. : evidence extractor trained on the silver standard evidence sentences extracted from the ILP-based distant supervision method.

4.2.4 Human Evaluation

MultiRC: Extracted evidence sentences, which help neural readers to find correct answers, may still fail to convince human readers. Thus we evaluate the quality of extracted evidence sentences based on human annotations (Table 6). Even trained using the noisy labels, we achieve a macro-average F1 score on MultiRC, indicating the learning and generalization capabilities of our evidence extractor, compared to , achieved by using the noisy silver standard evidence sentences guided by correct answer options.

RACE: Since RACE does not provide the ground truth evidence sentences, to get the ground truth evidence sentences, two internal annotators annotate questions from the RACE-Middle development set. The Cohen’s kappa coefficient between two annotations is . For negation questions which include negation words (e.g., Which statement is not true according to the passage?), we have two annotation strategies: we can either find sentences that can directly imply the correct answer option; or the sentences that support the wrong answer options. During annotation, for each question, we use the strategy that leads to fewer evidence sentences.

We find that even humans have troubles in locating evidence sentences when the relationship between a question and its correct answer option is implicitly implied. For example, a significant number of questions require understanding the entire document (e.g., “what’s the best title of this passage” and “this passage mainly tells us that _”) and/or external knowledge (e.g., “the writer begins with the four questions in order to _”, “The passage is probably from _” , and “If the writer continues the article, he would most likely write about_”). For of total questions, at least one annotator leave the slot blank due to the challenges mentioned above. The average and the maximum number of evidence sentences for the remaining questions is and respectively. The average number of evidence sentences in whole RACE dataset should be higher since questions in RACE-High are more difficult Lai et al. (2017), and we ignore of the total questions which require understanding the whole context. In MultiRC, the average/maximum number of evidence sentences is /, respectively.

Dataset Quasar-T SearchQA
Model Hits@1 Hits@3 Hits@5 Hits@1 Hits@3 Hits@5
Information Retrieval Lin et al. (2018) 6.3 10.9 15.2 13.7 24.1 32.7
INDEP Lin et al. (2018) 26.8 36.3 41.9 59.2 70.0 75.7
FULL Lin et al. (2018) 27.7 36.8 42.6 58.9 69.8 75.5
42.3 56.7 62.0 66.2 84.9 89.9
Table 5: Evidence extraction performance on two question answering datasets Quasar-T and SearchQA. INDEP: the sentence selector is trained independently; FULL: the sentence selector is trained jointly with a neural reader.
Dataset SE vs. GT EER vs. SE EER vs. GT
RACE-M 59.9 57.1 57.5
MultiRC 53.0 63.4 60.8
- - 63.1
Table 6: Macro-average F1 with human annotations on the dev set (SE: silver standard evidence sentences; EER: evidence sentences extracted by EER trained on SE, GT: ground truth evidence sentences).

4.3 Results on Question Answering Datasets

We are aware of some similar work Choi et al. (2017); Lin et al. (2018); Htut et al. (2018) that aim to select relevant paragraphs for question answering tasks. Since most of them do not release implementations, we compare with  lin2018denoising on two open-domain question answering datasets since their work is most similar to ours and the code is available. We report a direct comparison between our evidence extractor and this state-of-the-art sentence selector in Table 5. Our independently trained evidence extractor dramatically outperforms theirs, which is jointly trained with a neural reader. We obtain up to relative improvement on the Quasar-T dataset and relative improvement on the SearchQA dataset.

5 Related Work

5.1 Sentence Selection for MRC/Fact Verification

Previous studies investigate paragraph retrieval for factoid question answering Chen et al. (2017); Wang et al. (2018c); Choi et al. (2017); Lin et al. (2018), sentence selection for machine reading comprehension Hewlett et al. (2017); Min et al. (2018), and fact verification Yin and Roth (2018); Hanselowski et al. (2018). In these tasks, most of the factual questions/claims provide sufficient clues for identifying relevant sentences, thus often information retrieval combined with filters can serve as a very strong baseline. For example, in the FEVER dataset Thorne et al. (2018), only of claims require composition of multiple evidence sentences. Different from above work, we exploit information in answer options and use various indirect supervision to train our evidence extractor, and previous work can actually be a regarded as a special case for our pipeline. Compared to lin2018denoising, we leverage rich linguistic knowledge for denoising.

Several work also investigate content selection at the token level Yu et al. ; Seo et al. (2018), in which some tokens are automatically skipped by neural models. However, they do not utilize any linguistic knowledge, and a set of discontinuous tokens has limited explanation capability.

5.2 MRC with External Knowledge

Linguistic knowledge such as coreference resolution, frame semantics, and discourse relations is widely used to improve machine comprehension Wang et al. (2015); Sachan et al. (2015); Narasimhan and Barzilay (2015); Sun et al. (2018) especially when there are only hundreds of documents available in a dataset such as MCTest Richardson et al. (2013). Along with the creation of large-scale reading comprehension datasets, recent MRC models rely on end-to-end neural models, and it primarily uses word embeddings as input. However,  wang2016emergent,dhingra2017linguistic, dhingra2018neural show that existing neural models do not fully take advantage of the linguistic knowledge, which is still valuable for MRC. Besides widely used lexical features such as part-of-speech tags and named entity types Wang et al. (2016); Liu et al. (2017); Dhingra et al. (2017b, 2018), we consider more diverse types of external knowledge for performance improvements. Moreover, we accommodate external knowledge with probabilistic logic to potentially improve the interpretability of MRC models instead of using external knowledge as additional features.

5.3 Explainable MRC/Question Answering

To improve the interpretability of question answering, previous work utilize interpretable internal representations Palangi et al. (2017) or reasoning networks that employ a hop-by-hop reasoning process dynamically Zhou et al. (2018). A research line focuses on visualizing the whole derivation process from the natural language utterance to the final answer for question answering over knowledge bases Abujabal et al. (2017) or scientific word algebra problems Ling et al. (2017).  jansen2016s extract explanations that describe the inference needed for elementary science questions (e.g., “What form of energy causes an ice cube to melt”). In comparison, the derivation sequence is less apparent for open-domain questions, especially when they require external domain knowledge or multiple-sentence reasoning. To improve explainability, we can also check the attention map learned by neural readers Wang et al. (2016), however, attention map is learned in end-to-end fashion, which is different from our work.

A similar work proposed by sharp2017tell also uses distant supervision to learn how to extract informative justifications. However, their experiments are primarily designed for factoid question answering, in which it is relatively easy to extract justifications since most questions are informative. In comparison, we focus on multi-choice machine reading comprehension that requires deep understanding, and we pay particular attention to denoising strategies.

6 Conclusions

We propose an evidence extraction DNN trained with indirect supervision. To denoise noisy labels, we combine various linguistic clues through deep probabilistic logic framework. We equip state-of-the-art neural reader with extracted evidence sentences, and it achieves comparable or better performance than neural reader with full context on three datasets. Experimental results also show that our evidence sentence extractor is superior than other state-of-the-art sentence selectors. All those results indicate the effectiveness of our evidence extractor. For the future work, we aim to incorporate richer prior knowledge into DPL, jointly train the evidence extraction DNN and neural readers, and create large-scale dataset that contains ground truth evidence sentences.


Appendix A Appendices

Besides distant supervision, DPL also includes data programming and joint inference. For data programming, we design the following sentence-level labeling functions:

a.1 Sentence-Level Labeling Functions

  • Sentences contain the information asked in a question or not: for “when"-questions, a sentence must contain at least one time expression; for “who"-questions, a sentence must contain at least one person entity.

  • Whether a sentence and the correct answer option have a similar length: .

  • A sentence that is neither too short nor too long since those sentences tend to be less informative or contain irrelevant information: .

  • Reward for each sentence from a neural reader. We sample different sentences and use their probabilities of leading to the correct answer option as rewards. See Section 4.1 for details about reward calculation.

  • Paraphrase embedding similarity between a question and each sentence in a document: .

  • Word embedding similarity between a question and each sentence in a document: .

  • Whether question and sentence contain words that have the same entity type.

  • Whether a sentence and the question have the same sentiment classification result.

  • Language inference result between sentence and question: entail, contradiction, neutral.

  • # of matched tokens between the concatenated question and candidate sentence with the triples in ConceptNet Speer et al. (2017): .

  • If a question requires the document-level understanding, we prefer the first or the last three sentences in the reference document.

a.2 High-Order Factors

For joint inference, we consider the following high-order factors .

  • Adjacent sentences prefer the same label.

  • Evidence sentences for the same question should be within window size . For example, we assume and in Figure 1 are less likely to serve as evidence sentences for the same question.

  • Overlap ratio between evidence sentences for different questions is smaller than . We assume the same set of evidence sentences are less likely to support multiple questions.