A Self-Training Method for Machine Reading Comprehension with Soft Evidence Extraction

05/11/2020 ∙ by Yilin Niu, et al. ∙ Tsinghua University 0

Neural models have achieved great success on machine reading comprehension (MRC), many of which typically consist of two components: an evidence extractor and an answer predictor. The former seeks the most relevant information from a reference text, while the latter is to locate or generate answers from the extracted evidence. Despite the importance of evidence labels for training the evidence extractor, they are not cheaply accessible, particularly in many non-extractive MRC tasks such as YES/NO question answering and multi-choice MRC. To address this problem, we present a Self-Training method (STM), which supervises the evidence extractor with auto-generated evidence labels in an iterative process. At each iteration, a base MRC model is trained with golden answers and noisy evidence labels. The trained model will predict pseudo evidence labels as extra supervision in the next iteration. We evaluate STM on seven datasets over three MRC tasks. Experimental results demonstrate the improvement on existing MRC models, and we also analyze how and why such a self-training method works in MRC.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine reading comprehension (MRC) has received increasing attention recently, which can be roughly divided into two categories: extractive and non-extractive MRC. Extractive MRC requires a model to extract an answer span to a question from reference documents, such as the tasks in SQuAD (Rajpurkar et al., 2016) and CoQA (Reddy et al., 2019). In contrast, non-extractive MRC infers answers based on some evidence in reference documents, including Yes/No question answering (Clark et al., 2019), multiple-choice MRC (Lai et al., 2017; Khashabi et al., 2018; Sun et al., 2019), and open domain question answering (Dhingra et al., 2017b). As shown in Table 1, evidence plays a vital role in MRC (Zhou et al., 2019; Ding et al., 2019; Min et al., 2018), and the coarse-to-fine paradigm has been widely adopted in multiple models (Choi et al., 2017; Li et al., 2018; Wang et al., 2018) where an evidence extractor first seeks the evidence from given documents and then an answer predictor infers the answer based on the evidence. However, it is challenging to learn a good evidence extractor since there lack evidence labels for supervision.

Manually annotating the golden evidence is expensive. Therefore, some recent efforts have been dedicated to improving MRC by leveraging noisy evidence labels when training the evidence extractor. Some works (Lin et al., 2018; Min et al., 2018) generate distant labels using hand-crafted rules and external resources. Some studies (Wang et al., 2018; Choi et al., 2017)

adopt reinforcement learning (RL) to decide the labels of evidence, however such RL methods suffer from unstable training. More distant supervision techniques are also used to refine noisy labels, such as deep probability logic 

(Wang et al., 2019), but they are hard to transfer to other tasks. Nevertheless, improving the evidence extractor remains challenging when golden evidence labels are not available.

Q: Did a little boy write the note?
D: This note is from a little girl. She wants to be your friend. If you want to be her friend, …
A: No
Q: Is she carrying something?
D: …On the step, I find the elderly Chinese lady, small and slight, holding the hand of a little boy. In her other hand, she holds a paper carrier bag.
A: Yes
Table 1: Examples of Yes/No question answering. Evidential sentences in bold in reference documents are crucial to answer the questions.

In this paper, we present a general and effective method based on Self-Training (Scudder, 1965) to improve MRC with soft evidence extraction when golden evidence labels are not available. Following the Self-Training paradigm, a base MRC model is iteratively trained. At each iteration, the base model is trained with golden answers, as well as noisy evidence labels obtained at the preceding iteration. Then, the trained model generates noisy evidence labels, which will be used to supervise evidence extraction at the next iteration. The overview of our method is shown in Figure 1. Through this iterative process, evidence is labelled automatically to guide the RC model to find answers, and then a better RC model benefits the evidence labelling process in return. Our method works without any manual efforts or external information, and therefore can be applied to any MRC tasks. Besides, the Self-Training algorithm converges more stably than RL. Two main contributions in this paper are summarized as follows:

  1. We propose a self-training method to improve machine reading comprehension by soft evidence labeling. Compared with other existing methods, our method is more effective and general.

  2. We verify the generalization and effectiveness of STM on several MRC tasks, including Yes/No question answering (YNQA), multiple-choice machine reading comprehension (MMRC), and open-domain question answering (ODQA). Our method is applicable to different base models, including BERT and DSQA (Lin et al., 2018). Experimental results demonstrate that our proposed method improves base models in three MRC tasks remarkably.

2 Related Work

Early MRC studies focus on modeling semantic matching between a question and a reference document (Seo et al., 2017; Huang et al., 2018; Zhu et al., 2018; Mihaylov and Frank, 2018). In order to mimic the reading mode of human, hierarchical coarse-to-fine methods are proposed (Choi et al., 2017; Li et al., 2018). Such models first read the full text to select relevant text spans, and then infer answers from these relevant spans. Extracting such spans in MRC is drawing more and more attention, though still quite challenging (Wang et al., 2019).

Evidence extraction aims at finding evidential and relevant information for downstream processes in a task, which arguably improves the overall performance of the task. Not surprisingly, evidence extraction is useful and becomes an important component in fact verification (Zhou et al., 2019; Yin and Roth, 2018; Hanselowski et al., 2018; Ma et al., 2019), multiple-choice reading comprehension (Wang et al., 2019; Bax, 2013; Yu et al., 2019), open-domain question answering (Lin et al., 2018; Wang et al., 2018), multi-hop reading comprehension (Nishida et al., 2019; Ding et al., 2019), natural language inference (Wang et al., 2017; Chen et al., 2017), and a wide range of other tasks (Nguyen and Nguyen, 2018; Chen and Bansal, 2018).

In general, evidence extraction in MRC can be classified into four types according to the training method. First, unsupervised methods provide no guidance for evidence extraction 

(Seo et al., 2017; Huang et al., 2019). Second, supervised methods train evidence extraction with golden evidence labels, which sometimes can be generated automatically in extractive MRC settings (Lin et al., 2018; Yin and Roth, 2018; Hanselowski et al., 2018)

. Third, weakly supervised methods rely on noisy evidence labels, where the labels can be obtained by heuristic rules 

(Min et al., 2018). Moreover, some data programming techniques, such as deep probability logic, were proposed to refine noisy labels (Wang et al., 2019). Last, if a weak extractor is obtained via unsupervised or weakly supervised pre-training, reinforcement learning can be utilized to learn a better policy of evidence extraction (Wang et al., 2018; Choi et al., 2017).

For non-extractive MRC tasks, such as YNQA and MMRC, it is cumbersome and inefficient to annotate evidence labels (Ma et al., 2019). Although various methods for evidence extraction have been proposed, training an effective extractor is still a challenging problem when golden evidence labels are unavailable. Weakly supervised methods either suffer from low performance or rely on too many external resources, which makes them difficult to transfer to other tasks. RL methods can indeed train a better extractor without evidence labels. However, they are much more complicated and unstable to train, and highly dependent on model pretraining.

Our method is based on Self-Training, a widely used semi-supervised method. Most related studies follow the framework of traditional Self-Training (Scudder, 1965) and Co-Training (Blum and Mitchell, 1998), and focus on designing better policies for selecting confident samples. CoTrade (Zhang and Zhou, 2011) evaluates the confidence of whether a sample has been correctly labeled via a statistic-based data editing technique (Zighed et al., 2002). Self-paced Co-Training (Ma et al., 2017) adjusts labeled data dynamically according to the consistency between the two models trained on different views. A reinforcement learning based method (Wu et al., 2018) designs an additional Q-agent as a sample selector.

3 Methods

3.1 Task Definition and Model Overview

The task of machine reading comprehension can be formalized as follows: given a reference document composed of a number of sentences and a question , the model should extract or generate an answer to this question conditioned on the document, formally as

The process can be decomposed into two components, i.e., an evidence extractor and an answer predictor. The golden answer is given for training the entire model, including the evidence extractor and the answer predictor. Denote as a binary evidence label for the -th sentence , where corresponds to the non-evidence/evidence sentence, respectively. An auxiliary loss on the evidence labels can help the training of the evidence extractor.

The overview of our method is shown in Figure 1, which is an iterative process. During training, two data pools are maintained and denoted as (unlabeled data) and (labeled data). In addition to golden answers, examples in are annotated with pseudo evidence labels. In contrast, there are only golden answers provided in . At each iteration, the base model is trained on both data pools (two training arrows). After training, the model makes evidence predictions on unlabeled instances (the labeling arrow), and then chooses the most confident instances from to provide noisy evidence labels. In particular, the instances with newly generated evidence labels are moved from to (the moving arrow), which are used to supervise evidence extraction in the next iteration. This process will iterate several times.

Figure 1: Overview of Self-Training MRC (STM). The base model is trained on both and . After training, the base model is used to generate evidence labels for the data from , and then chooses the most confident samples, which will be used to supervise the evidence extractor at the next iteration. The selected data is moved from to at each iteration.

3.2 Base Model

As shown in Figure 2, the overall structure of a base model consists of an encoder layer, an evidence extractor, and an answer predictor.

Figure 2: Overall structure of a base model that consists of an encoder layer, an evidence extractor, and an answer predictor. The encoders will obtain for the question, and

for each sentence in a document. The summary vector

will be used to predict the answer.

The encoder layer takes document and question as input to obtain contextual representation for each word. Denote as the representation of the -th word in , and as the representation of the -th word in question . Our framework is agnostic to the architecture of the encoder, and we show improvements on two widely used encoding models, i.e. Transformer (with BERT, Devlin et al., 2019) and LSTM (with DSQA, Lin et al., 2018) in the experiments.

The evidence extractor employs hierarchical attention, including token- and sentence-level attention, to obtain the document representation .

Token-level attention obtains a sentence vector by self-attention (Vaswani et al., 2017) within the words in a sentence, as follows:

where is the sentence representation of the question. refers to the importance of word in sentence , and so on for . and are learnable parameters. The attention function follows the bilinear form (Kim et al., 2018).

Sentence-level attention identifies important sentences conditioned on the question in a soft way to get the summary vector (), as follows:

where has the same bilinear form as with different parameters. refers to the importance of the corresponding sentence.

The answer predictor adopts different structures for different MRC tasks. For Yes/No question answering, we use a simple linear classifier to infer answers. For multiple choice MRC, we use a Multiple Layer Perceptron (MLP) with Softmax to obtain the score of each choice. And for open-domain question answering, one MLP is used to predict the answer start, and another MLP is used to predict the end.

3.3 Loss Function

We adopt two loss functions, one for task-specific loss and the other for evidence loss.

The task-specific loss is defined as the negative log-likelihood (NLL) of predicting golden answers, formally as follows:

where denotes the predicted answer and is the golden answer.

When the evidence label is provided, we can impose supervision on the evidence extractor. For the most general case, we assume that a variable number of evidence sentences exist in each sample . Inspired by previous work (Nishida et al., 2019) that used multiple evidences, we calculate the evidence loss step by step. Suppose we will extract evidence sentences. At the first step, we compute the loss of selecting the most plausible evidence sentence. At the second step, we compute the loss in the remaining sentences, where the previously selected sentence is masked and not counted in computing the loss at the second step. The overall loss is the average of all the step-by-step loss until we select out evidence sentences. In this manner, we devise a BP-able surrogate loss function for choosing the top evidence sentences.

Formally, we have

where is the number of evidence sentences, a pre-specified hyperparamter. and each is a sentence mask, where means sentence is not selected before step , and means selected.

At each step, the model will compute an attention distribution over the unselected sentences, as follows:

As for the previously selected sentences, the attention weight on those sentences will be zero, in other words, they are masked out. Then, the step-wise loss can be computed as follows:

where indicates the attention weight for sentence , and is the evidence label for sentence . The sentence with the largest attention weight will be chosen as the -th evidence sentence.

For each sentence , is initialized to be . At each step , the mask will be set to if sentence is chosen as an evidence sentence at the preceding step , and the mask remains unchanged otherwise. Formally, the mask is updated as follows:

During training, the total loss is the combination of the task-specific loss and the evidence loss:


where is a factor to balance the two loss terms. and denote the two sets in which instances with and without evidence labels, respectively. Note that the evidence label in is automatically obtained in our self-training method.

3.4 Self-Training MRC (STM)

STM is designed to improve base MRC models via generating pseudo evidence labels for evidence extraction when golden labels are unavailable. STM works in an iterative manner, and each iteration consists of two stages. One is to learn a better base model for answer prediction and evidence labelling. The other is to obtain more precise evidence labels for the next iteration using the updated model.

At each iteration, STM first trains the base model with golden answers and pseudo evidence labels from the preceding iteration using the total loss as defined Equation 3.3. Then the trained model can predict a distribution of pseudo evidence labels for each unlabelled instance , and decides as


Define the confidence of a labelled instance as

selects the instances with the largest confidence scores whose and are smaller than the prespecified thresholds. These labelled instances will be moved from to for the next iteration.

In the first iteration (iteration 0), the initial labeled set is set to an empty set, thus the base model is supervised only by golden answers. In this case, the evidence extractor is trained in a distant supervised manner.

The procedure of one iteration of STM is illustrated in Algorithm 1. and are two thresholds (hyper-parameters). operation ranks the candidate samples according to their confidence scores and returns the top- samples. varies different datasets, and details are presented in the appendix.

1:Training sets ; Thresholds and ; Number of generated labels ; Weight of evidence loss ;
2:Trained MRC model ; Updated training sets ;
3:Randomly initialize ;
4:Train on and ;
5:Initialize ;
6:for each  do
7:     ;
8:     Generate via Equation 2;
9:     ;
10:     if  then
11:         ;
12:         Add to ;
13:     end if
14:end for
17:return ;
Algorithm 1 One iteration of STM

3.5 Analysis

To understand why STM can improve evidence extraction and the performance of MRC, we revisit the training process and present a theoretical explanation, as inspired by (Anonymous, 2020).

In Section 3.4, we introduce the simple labeling strategy used in STM. If there is no sample selection, the evidence loss can be formulated as

where represents , and is the parameter of the -th iteration. In this case, pseudo evidence labels are randomly sampled from to guide , and therefore minimizing will lead to . As a matter of fact, the sample selection strategy in STM is to filter out the low-quality pseudo labels with two distribution mappings, and . The optimizing target becomes

In STM, is a filter function with two pre-specified thresholds, and . is defined as (Equation 2). Compared with random sampling, our strategy tends to prevent from learning wrong knowledge from . And the subsequent training might benefit from implicitly learning the strategy. In general, the strategy of STM imposes naive prior knowledge on the base models via the two distribution mappings, which may partly explain the performance gains.

4 Experiments

Model / Dataset CoQA MARCO BoolQ
BERT-MLP 78.0 70.8 71.6
BERT-HA 78.8 71.3 72.9
BERT-HA+RL 79.3 70.3 70.4
BERT-HA+Rule 78.1 70.4 73.8
BERT-HA+STM 80.5 72.3 75.2
BERT-HA+Gold 82.0 N/A N/A
Table 2:

Classification accuracy on three Yes/No question answering datasets. N/A means there is no golden evidence label. Significance tests were conducted between BERT-HA+STM and the best baseline of each column (t-test).

means -value , and means -value .
Model / Dataset RACE-M RACE-H MultiRC DREAM
Dev Test Dev Test Dev Dev Test
Acc Acc Acc Acc F1 F1 EM Acc Acc
GPT+DPL 64.2 62.4 58.5 60.2 70.5 67.8 13.3 57.3 57.7
BERT-MLP 66.2 65.5 61.6 59.5 71.8 69.1 21.2 63.9 63.2
BERT-HA 67.8 68.2 62.6 60.4 70.1 68.1 19.9 64.2 62.8
BERT-HA+RL 68.5 66.9 62.5 60.0 72.1 69.5 21.1 63.1 63.4
BERT-HA+Rule 66.6 66.4 61.6 59.0 69.5 66.7 17.9 62.5 63.0
BERT-HA+STM 69.3 69.2 64.7 62.6 74.0 70.9 22.0 65.3 65.8
BERT-HA+Gold N/A N/A N/A N/A 73.7 70.9 27.2 N/A N/A
Table 3: Results on three multiple-choice reading comprehension datasets. (F1: F1 score on all answer-options; F1: macro-average F1 score of all questions; EM: exact match.) Note that there is no golden evidence label on RACE and DREAM. The results for DPL (deep programming logic) are copied from (Wang et al., 2019). Significance tests were conducted between BERT-HA+STM and the best baseline of each column (t-test). means -value , and means -value .

4.1 Datasets

4.1.1 Yes/No Question Answering (YNQA)

CoQA (Reddy et al., 2019) is a multi-turn conversational question answering dataset where questions may be incomplete and need historical context to get the answers. We extracted the Yes/No questions from CoQA, along with their histories, to form a YNQA dataset.

BoolQ (Clark et al., 2019) consists of Yes/No questions from the Google search engine. Each question is accompanied by a related paragraph. We expanded each short paragraph by concatenating some randomly sampled sentences.

MS MARCO (Nguyen et al., 2016) is a large MRC dataset. Each question is paired with a set of reference documents and the answer may not exist in the documents. We extracted all Yes/No questions, and randomly picked some reference documents containing evidence111The evidence annotation in a document is provided by the original dataset.. To balance the ratio of Yes and No questions, we randomly removed some questions whose answers are Yes.

4.1.2 Multiple-choice MRC

RACE (Lai et al., 2017) consists of about 28,000 passages and 100,000 questions from English exams for middle (RACE-M) and high (RACE-H) schools of China. The average number of sentences per passage in RACE-M and RACE-H is about 16 and 17, respectively.

DREAM (Sun et al., 2019) contains 10,197 multiple-choice questions with 6,444 dialogues, collected from English examinations. In DREAM, 85% of the questions require reasoning with multiple evidential sentences.

MultiRC (Khashabi et al., 2018) is an MMRC dataset where the amount of correct options to each question varies from 1 to 10. Each question in MultiRC is annotated with evidence from its reference document. The average number of annotated evidence sentences for each question is 2.3.

4.1.3 Open-domain QA (ODQA)

Quasar-T (Dhingra et al., 2017b) consists of 43,000 open-domain trivial questions, whose answers were extracted from ClueWeb09. For fair comparison, we retrieved 50 reference sentences from ClueWeb09 for each question the same as DSQA (Lin et al., 2018).

4.2 Baselines

We compared several methods in our experiments, including some powerful base models without evidence supervision and some existing methods (*+Rule/RL/DPL/STM) which improve MRC with noisy evidence labels. Experimental details are shown in the appendix.

YNQA and MMRC: (1) BERT-MLP utilizes a BERT encoder and an MLP answer predictor. The predictor makes classification based on the BERT representation at the position of [CLS]. The parameters of the BERT module were initialized from BERT-base. (2) BERT-HA refers to the base model introduced in Section 3.2, which applies hierarchical attention over words and sentences. (3) Based on BERT-HA, BERT-HA+Rule

supervises the evidence extractor with noisy evidence labels, which are derived from hand-crafted rules. We have explored three types of rules based on Jaccard similarity, integer linear programming (ILP) 

(Boudin et al., 2015), and inverse term frequency (ITF) (Wang et al., 2019), among which ITF performed best in most cases. For simplicity, we merely provided experimental results with the rule of ITF. (4) Based on BERT-HA, BERT-HA+RL trains the evidence extractor via reinforcement learning, similar to (Choi et al., 2017). And (5) another deep programming logic (DPL) method, GPT+DPL (Wang et al., 2019), is complicated and the source code is not provided, thus We directly used the results from the original paper and did not evaluate it on BERT.

ODQA: (1) For each question, DSQA (Lin et al., 2018) aggregates multiple relevant paragraphs from ClueWeb09, and then infers an answer from these paragraphs. (2) GA (Dhingra et al., 2017a) and BiDAF (Seo et al., 2017) perform semantic matching between questions and paragraphs with attention mechanisms. And (3) R (Wang et al., 2018) is a reinforcement learning method that explicitly selects the most relevant paragraph to a given question for the subsequent reading comprehension module.

Model EM F1
GA (Dhingra et al., 2017a) 26.4 26.4
BiDAF (Seo et al., 2017) 25.9 28.5
R (Wang et al., 2018) 35.3 41.7
DSQA (Lin et al., 2018) 40.7 47.6
 +distant supervision 41.7 48.7
 +STM 41.8 49.2
Table 4: Experimental results on the test set of Quasar-T. R is a RL-based method. Results of GA, BiDAF and R are copied from (Lin et al., 2018). DSQA+STM outperforms the best baseline (DSQA+DS) significantly (t-test, -value , DS=distant supervision).

4.3 Main Results

4.3.1 Yes/No Question Answering

Table 2 shows the results on the three YNQA datasets. We merely reported the classification accuracy on the development sets since the test sets are unavailable.

BERT-HA+STM outperformed all the baselines, which demonstrates the effectiveness of our method. Compared with BERT-MLP, BERT-HA achieved better performance on all the three datasets, indicating that distant supervision on evidence extraction can benefit Yes-No question answering. However, compared with BERT-HA, BERT-HA+RL made no improvement on MARCO and BoolQ, possibly due to the high variance in training. Similarly, BERT-HA+Rule performed worse than BERT-HA on CoQA and MARCO, implying that it is more difficult for the rule-based methods (inverse term frequency) to find correct evidence in these two datasets. In contrast, our method BERT-HA+STM is more general and performed the best on all datasets. BERT-HA+STM achieved comparable performance with BERT-HA+Gold which stands for the upper bound by providing golden evidence labels, indicating that the effectiveness of noisy labels in our method.

4.3.2 Multiple-choice MRC

Table 3 shows the experimental results on the three MMRC datasets. We adopt the metrics from the referred papers. STM improved BERT-HA consistently on RACE-H, MultiRC and DREAM in terms of all the metrics. However, the improvement on RACE-M is limited ( gain on the test sets). The reason may be that RACE-M is much simpler than RACE-H, and thus, it is not challenging for the evidence extractor of BERT-HA to find the correct evidence on RACE-M.

4.3.3 Open-domain Question Answering

Table 4 shows the exact match scores and F1 scores on Quasar-T. Distant evidence supervision (DS) indicates whether a passage contains the answer text. Compared with the base models DSQA and DSQA+DS, DSQA+STM achieved better performance in both metrics, which verifies that DSQA can also benefit from Self-Training. Our method is general and can improve both lightweight and heavyweight models, like LSTM-based and BERT-based models, in different tasks.

Model/Dataset CoQA MultiRC
P@1 R@1 R@2 R@3 P@1 P@2 P@3
BERT-HA 20.0 28.2 49.8 62.5 62.3 55.2 46.6
 +RL 5.2 10.5 22.3 32.9 24.0 25.3 24.7
 +Rule 38.4 32.4 53.6 65.1 71.8 59.6 48.7
 +STM (iter 1) 32.7 32.8 57.1 70.1 72.2 63.3 52.5
 +STM (iter 2) 37.3 32.9 58.0 71.3 72.7 64.4 53.5
 +STM (iter 3) 39.9 31.4 55.3 68.8 69.5 61.6 51.6
BERT-HA+Gold 53.6 33.7 59.5 73.4 74.5 65.9 54.8
Table 5: Evidence extraction evaluation on the development sets of CoQA and MultiRC. P@ / R@ represent precision / recall of the generated evidence labels, respectively for top predicted evidence sentences.

4.4 Performance of Evidence Extraction

To evaluate the performance of STM on evidence extraction, we validated the evidence labels generated by several methods on the development sets of CoQA and MultiRC. Considering that the evidence of each question in MultiRC is a set of sentences, we adopted and

as the metrics for MultiRC, which represent the precision and recall of the generated evidence labels, respectively, when

sentences are predicted as evidence. We adopted only as the metric for CoQA as this dataset provides each question with one golden evidence sentence.

Table 5 shows the performance of five methods for evidence labeling on the CoQA and MultiRC development sets. It can be seen that BERT-HA+STM outperformed the base model BERT-HA by a large margin in terms of all the metrics. As a result, the evidence extractor augmented with STM provided more evidential information for the answer predictor, which may explain the improvements of BERT-HA+STM on the two datasets.

4.5 Analysis on Error Propagation

To examine whether error propagation exists and how severe it is in STM, we visualized the evolution of evidence predictions on the development set of CoQA (Figure 3). From the inside to the outside, the four rings show the statistic results of the evidence predicted by BERT-HA (iteration 0) and BERT-HA+STM (iteration 1, 2, 3). Each ring is composed of all the instances from the development set of CoQA, and each radius corresponds to one sample. If the evidence of an instance is predicted correctly, the corresponding radius is marked in green, otherwise in purple. Two examples are shown in the appendix due to space limit.

Self-correction. As the innermost ring shows, about of the evidence predicted by BERT-HA (iter 0) was incorrect. However, the proportion of wrong instances reduced to after self-training (iter 3). More concretely, of the wrong predictions were gradually corrected with high confidence within three self-training iterations, as exemplified by instance A in Figure 3.

Error propagation. We observed that of the evidence was mistakenly revised by STM, as exemplified by instance B in Figure 3. In such a case, the incorrect predictions are likely to be retained in the next iteration. But almost of such mistakes were finally corrected during the subsequent iterations like instance C. This observation shows that STM can prevent error propagation to avoid catastrophic failure.

Figure 3: Evolution of evidence predictions on the development set of CoQA. From the inside to the outside, the four rings correspond to BERT-HA (iteration 0) and BERT-HA+STM (iteration 1, 2, 3), respectively.

4.6 Improvement Over Stronger Pretrained Models

To evaluate the improvement of STM over stronger pretrained models, we employed RoBERTa-large (Liu et al., 2019) as the encoder in the base model. Table 6 shows the results on CoQA. STM significantly improved the evidence extraction (Evi. Acc) of the base model. However, the improvement on answer prediction (Ans. Acc) is marginal. One reason is that RoBERTa-HA achieved so high performance that there was limited room to improve. Another possible explanation is that evidence information is not important for such stronger models to generate answers. In other words, they may be more adept at exploiting data bias to make answer prediction. In comparison, weaker pretrained models, such as BERT-base, can benefit from evidence information due to their weaker ability to exploit data bias.

Model/Metric Ans. Acc Evi. Acc
RoBERTa-HA 92.6 13.8
RoBERTa-HA+STM 92.7 19.3(+40%)
Table 6: Answer prediction accuracy (Ans. Acc) and evidence extraction accuracy (Evi. Acc) on the development set of CoQA.

5 Conclusion and Future Work

We present an iterative self-training method (STM) to improve MRC models with soft evidence extraction, when golden evidence labels are unavailable. In this iterative method, we train the base model with golden answers and pseudo evidence labels. The updated model then generates new pseudo evidence labels, which can be used as additional supervision in the next iteration. Experiments results show that our proposed method consistently improves the base models in seven datasets for three MRC tasks, and that better evidence extraction indeed enhances the final performance of MRC.

As future work, we plan to extend our method to other NLP tasks which rely on evidence finding, such as natural language inference.


This work was jointly supported by the NSFC projects (Key project with No. 61936010 and regular project with No. 61876096), and the National Key R&D Program of China (Grant No. 2018YFC0830200). We thank THUNUS NExT Joint-Lab for the support.


Appendix A Case Study

In Section 4.5 of the main paper, we provide a quantitative analysis of the evolution of evidence predictions, and draw two conclusions: (1) STM can help the base model to correct itself; (2) Error propagation will not result in catastrophic failure, though exists.

To help understand these two conclusions, we provide two corresponding cases from the development set of CoQA (Reddy et al., 2019). The original instances are shown in Table 7, and the weight distribution from the sentence-level attention is shown in Figure 4. In case 1, BERT-HA made wrong evidence prediction, while STM revised it subsequently, which shows the ability of self-correction. In case 2, BERT-HA first selected the correct evidence with high confidence. However, in the iteration 1, BERT-HA with STM was distracted by another plausible sentence. Instead of insisting on the incorrect prediction, STM led BERT-HA back to the right way, which shows that error propagation is not catastrophic.

Figure 4: Weight distribution of the two cases from the sentence-level attention.

Appendix B Hyper-Parameters for Self-Training

We implemented BERT-HA with BERT-base from a commonly used library222https://github.com/huggingface/transformers, and directly used the original source code of DSQA333https://github.com/thunlp/OpenQA (Lin et al., 2018). All the codes and datasets will be released after the review period. The hyper-parameters used in BERT-HA and BERT-HA+STM are shown in Table 8.

(Case 1)
…(3)”Why don’t you tackle Indian River, Daylight?” (4)Harper advised, at parting. (5)”There’s whole slathers of creeks and draws draining in up there, and somewhere gold just crying to be found. (6)That’s my hunch. (7)There’s a big strike coming, and Indian River ain’t going to be a million miles away. (8)”And the place is swarming with moose,” Joe Ladue added. (9)”Bob Henderson’s up there somewhere, been there three years now, swearing something big is going to happen, living off’n straight moose and prospecting around like a crazy man.” (10)Daylight decided to go Indian River a flutter, as he expressed it; but Elijah could not be persuaded into accompanying him. Elijah’s soul had been seared by famine, and he was obsessed by fear of repeating the experience. (11)”I jest can’t bear to separate from grub,” he explained. (12)”I know it’s downright foolishness, but I jest can’t help it…”
Question: Are there many bodies of water there?
Answer: No
(Case 2)
(1)If you live in the United States, you can’t have a full-time job until you are 16 years old. (2)At 14 or 15, you work part-time after school or on weekends, and during summer vacation you can work 40 hours each week. (3)Does all that mean that if you are younger than 14, you can’t make your own money? (4)Of course not! (5)Kids from 10-13 years of age can make money by doing lots of things. (6)Valerie, 11, told us that she made money by cleaning up other people’s yards. …(11)Kids can learn lots of things from making money. (12)By working to make your own money, you are learning the skills you will need in life. (13)These skills can include things like how to get along with others, how to use technology and how to use your time wisely. (14)Some people think that asking for money is a lot easier than making it; however, if you can make your own money, you don’t have to depend on anyone else…
Question: Can they learn time management?
Answer: No
Table 7: Examples from the development set of CoQA. Evidential sentences in red in reference passages are crucial to answer the questions. Sentences in blue are distracting as Figure 4 shows.
380 380 512 512 512 480 512
learning rate 5e-5/4e-5 5e-5/4e-5 2e-5 2e-5 2e-5 2e-5 3e-5
epoch 3 3 5 8 3 2/3 4
0.8 0.8 0.8 0.8 0.8 0.8 0.8
batch size 32 32 32 32 6 8 6
0.5 0.5 0.5 0.5 0.6 0.5 0.5
0.9 0.9 0.8 0.8 0.9 0.9 0.7
40000 10000 3000 2000 1500 1000 500
2 3 4 3 1 1 1
Table 8: Hyper-parameters marked with / are used in BERT-HA/BERT-HA+STM, respectively. Other unmarked hyper-parameters are shared by these two models.