Improving Machine Reading Comprehension with General Reading Strategies

10/31/2018 ∙ by Kai Sun, et al. ∙ Tencent cornell university 0

Reading strategies have been shown to improve comprehension levels, especially for readers lacking adequate prior knowledge. Just as the process of knowledge accumulation is time-consuming for human readers, it is resource-demanding to impart rich general domain knowledge into a language model via pre-training (Radford et al., 2018; Devlin et al., 2018). Inspired by reading strategies identified in cognitive science, and given limited computational resources - just a pre-trained model and a fixed number of training instances - we therefore propose three simple domain-independent strategies aimed to improve non-extractive machine reading comprehension (MRC): (i) BACK AND FORTH READING that considers both the original and reverse order of an input sequence, (ii) HIGHLIGHTING, which adds a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers, and (iii) SELF-ASSESSMENT that generates practice questions and candidate answers directly from the text in an unsupervised manner. By fine-tuning a pre-trained language model (Radford et al., 2018) with our proposed strategies on the largest existing general domain multiple-choice MRC dataset RACE, we obtain a 5.8 best result achieved by the same pre-trained model fine-tuned on RACE without the use of strategies. We further fine-tune the resulting model on a target task, leading to new state-of-the-art results on six representative non-extractive MRC datasets from different domains (i.e., ARC, OpenBookQA, MCTest, MultiRC, SemEval-2018, and ROCStories). These results indicate the effectiveness of the proposed strategies and the versatility and general applicability of our fine-tuned models that incorporate these strategies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently we have seen increased interest in machine reading comprehension (MRC) Rajpurkar et al. (2016); Choi et al. (2018); Kočiskỳ et al. (2018); Reddy et al. (2018). In this paper, we mainly focus on non-extractive MRC Richardson et al. (2013); Khashabi et al. (2018); Ostermann et al. (2018); Clark et al. (2018). Given a reference document or corpus, answering questions requires diverse reading skills, and the majority of candidate answers are non-extractive (Section 2.2). Compared to extractive MRC tasks (Section 2.1), the performance of machine readers on these tasks more accurately indicates the comprehension ability of machines in realistic settings Lai et al. (2017).

Similar to the process of knowledge accumulation for human readers, imparting massive amounts of general domain knowledge from external corpus into a high-capacity language model is time-consuming and resource-demanding. For example, it takes about one month to pre-train a -layer transformer Liu et al. (2018) using eight GPUs over 7,000 books Radford et al. (2018). A very recent study Devlin et al. (2018) claims that they pre-train a -layer transformer using TPUs for four days over the same book corpus and English Wikipedia ( one year to train on eight most advanced GPUs such as P100s), which is almost non-reproducible considering the tremendous computational resources.

The utilization of reading strategies has been shown effective in improving comprehension levels of human readers, especially those who lack adequate prior knowledge of the topic of the text Sheorey and Mokhtari (2001); McNamara et al. (2004). From a practical viewpoint, given a limited number of training instances and a pre-trained model, which we can regard as a human reader with fixed prior knowledge, can we also apply domain-independent strategies to improve the reading comprehension levels of machine readers?

Inspired by reading strategies of human readers identified in cognitive science research Sheorey and Mokhtari (2001); Mokhtari and Sheorey (2002); Mokhtari and Reichard (2002), based on an existing pre-trained transformer Radford et al. (2018) (Section 3.1), we propose three corresponding domain-independent strategies as follows.

  • Back and Forth Reading (“I go back and forth in the text to find relationships among ideas in it.”):
    considering both the original and reverse order of an input sequence (Section 3.2)

  • Highlighting (“I highlight information in the text to help me remember it.”):
    adding a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers (Section 3.3)

  • Self-Assessment (“I ask myself questions I like to have answered in the text, and then I check to see if my guesses about the text are right or wrong.”):
    generating practice questions and their associated span-based candidate answers from the existing reference documents (Section 3.4)

By fine-tuning a pre-trained transformer Radford et al. (2018) with our proposed strategies on the largest general domain multiple-choice machine reading comprehension dataset RACE Lai et al. (2017) that are collected from language exams (Section 4.2), we obtain an accuracy of (passing performance), a absolute improvement over the previous best result achieved by the same pre-trained transformer fine-tuned on RACE without the use of strategies. We further fine-tune the resulting high-performing model on a target task. Experiments show that our method achieves new state-of-the-art results on six representative non-extractive machine reading comprehension datasets that require a range of skills such as commonsense and multiple-sentence reasoning (i.e., ARC Clark et al. (2018), OpenBookQA Mihaylov et al. (2018), MCTest Richardson et al. (2013), MultiRC Khashabi et al. (2018), SemEval-2018 Yang et al. (2017), and ROCStories Mostafazadeh et al. (2016)) (Section 4.3). These results indicate the effectiveness of our proposed strategies and the versatility and generality of our fine-tuned models that incorporate these strategies.

2 Task Introduction

We roughly categorize machine reading comprehension tasks into two groups: extractive (Section 2.1) and non-extractive (Section 2.2) based on the expected answer types.

RACE ARC OpenBookQA MCTest MultiRC SemEval-2018 Task 11 ROCStories
construction method exams exams exams crowd. crowd. crowd. crowd.
sources of documents general science science stories 7 domains narrative text stories
average # of answer options 4.0 4.0 4.0 4.0 5.4 2.0 2.0
# of documents 27,933 14M 1,326 660 871 13,939 3,742
# of questions 97,687 7,787 5,957 2,640 9,872 2,119
non-extractive answer (%) 87.0 43.3 83.8 45.3 82.1 89.9 100.0
Table 1: Statistics of multiple-choice machine reading comprehension datasets. Some values come from reddy2018coqa, kovcisky2018narrativeqa, and lai2017race (crowd.: crowdsourcing; : regarding each sentence/claim as a document Clark et al. (2018); : correct answer options that are not text snippets from reference documents).

2.1 Extractive MRC

Recently several large-scale extractive machine reading comprehension datasets have been constructed Hermann et al. (2015); Hill et al. (2016); Onishi et al. (2016); Chen and Choi (2016); Mostafazadeh et al. (2016); Bajgar et al. (2016); Nguyen et al. (2016); Joshi et al. (2017); Ma et al. (2018), such as SQuAD Rajpurkar et al. (2016) and NewsQA Trischler et al. (2017). Given a reference document and a question, the expected answer is a short span from the document. Considering the answer type limitations in these datasets, answers in datasets such as MS MARCO Nguyen et al. (2016), SearchQA Dunn et al. (2017), and NarrativeQA Kočiskỳ et al. (2018) are human generated based on given documents Reddy et al. (2018); Choi et al. (2018). Since annotators tend to copy spans as answers directly, the majority of answers are still extractive Reddy et al. (2018). For example, of questions in SQuAD Reddy et al. (2018) and of questions in NarrativeQA expect extractive answers Kočiskỳ et al. (2018). Given informative factoid questions, state-of-the-art attention-based neural models Wang et al. (2018b); Hu et al. (2018) designed for extractive MRC tasks have already achieved very high performance based on the local context.

2.2 Multiple-Choice MRC Datasets

For multiple-choice MRC datasets, given a question and a reference document/corpus, multiple answer options are provided. There is at least one correct answer option. We primarily discuss the non-extractive datasets, in which answer options are not restricted to extractive text spans. Building a multiple-choice dataset by crowdsourcing (e.g., MCTest Richardson et al. (2013), MultiRC Khashabi et al. (2018), and SemEval-2018 Task 11 Ostermann et al. (2018)) involves extensive human involvement in designing questions and answer options. Besides crowdsourcing, datasets such as RACE Lai et al. (2017), ARC Clark et al. (2018), and OpenbookQA Mihaylov et al. (2018) are collected from language or science examinations designed by educational experts Penas et al. (2014); Shibuki et al. (2014); Tseng et al. (2016)

, which aim to test the comprehension level of human participants. Compared to questions in extractive MRC tasks, besides surface matching, there are various types of complicated questions such as math word problems, summarization, logical reasoning, and sentiment analysis, requiring advanced reading skills such as reasoning over multiple sentences and the use of prior world knowledge. We can adopt more objective evaluation criteria such as accuracy 

Clark et al. (2016); Lai et al. (2017). As this kind of datasets are relatively difficult to construct or collect, most of the existing datasets are small in size, which hinders the utilization of state-of-the-art deep neural models. In this paper, we explore how to fully make use of the limited resources to improve machine reading comprehension.

We summarize six popular representative multiple-choice MRC datasets in Table 1. As shown in the table, most of the correct answer options are non-extractive. Except for MultiRC, there is exactly one correct answer option for each question. For ARC and OpenBookQA, a reference corpus is provided instead of a single reference document associated with each question.

Here we give a formal task definition. Let , , and denote the text of document, question, and answer options respectively. Given , the task is to select the correct answer options from associated with .

3 Approach

Figure 1: Framework Overview. Strategy 1, 2, and 3 refer to back and forth reading (Section 3.2), highlighting (Section 3.3), and self-assessment (Section 3.4), respectively.

In this section, we first introduce a pre-trained transformer used as our neural reader (Section 3.1) and then elaborate our strategies for enhancements. For convenience, we borrow the names of the three reading strategies mentioned in the introduction for use as the names of our strategies: back and forth reading (Section 3.2), highlighting (Section 3.3), and self-assessment (Section 3.4).

3.1 Framework Overview

We use the OpenAI fine-tuned transformer language model (OFT) Radford et al. (2018) as a neural reader. It adapts a pre-trained multi-layer transformer Vaswani et al. (2017); Liu et al. (2018) language model to a labeled dataset , where each instance consists of a sequence of input tokens , along with a label , by maximizing:

(1)

where is the likelihood from the language model, is the weight of language model, and is obtained by a linear classification layer over the final transformer block’s activation of the language model. For multiple-choice MRC tasks, come from the concatenation of a start token, document, question, a delimiter token, answer option, and an end token; indicates if an answer option is correct. We refer the reader to radfordimproving for more details.

Apart from placing delimiters to separate document, question, and answer option from one another, the original OFT framework pays less attention to the task structure in MRC tasks. Inspired by previous research on human reading strategies, with limited resources and the pre-trained transformer, we propose three strategies to improve machine reading comprehension. We show the whole framework in Figure 1.

3.2 Back and Forth Reading

For simplicity, we represent the input sequence of the original OFT as where , , and represent start token, delimiter token, and end token respectively. In our framework, we consider both the original order and its reverse order . The token order within , , or is still preserved. We train two OFTs that use and as the input sequence respectively, and then ensemble the two models. We will discuss other similar pairs of input sequences such as and in the experiment (Section 4.4).

3.3 Highlighting

In the original OFT, the text embedding of a document is independent of the context of questions and answer options. Inspired by highlights used in human reading, we aim to make document encoding be aware of the context of a given (question q, answer option ) pair. We focus on content words (nouns, verbs, adjectives, adverbs, numerals, and foreign words) in questions and answer options since they appear to provide more useful information Baroni and Zamparelli (2010); Mirza and Bernardi (2013).

Formally, we let be the set of part of speech (POS) tags of content words. We let denote the sequence of the text embedding of document . represents the token in , and the text embedding of is denoted by . Given and a (, ) pair, we define a highlight embedding for the token in as:

(2)

where and

are two trainable vectors of the same dimension as

.

Following the above definition, the sequence of the highlight embedding is of the same length as . We replace in the original OFT with when we encode a document. More specifically, we use the concatenation of , , , , , and as the new input of the pre-trained transformer (Section 3.1), where , , and denote the embedding of start token, delimiter token, and end token, respectively, and and represent the sequence of the text embedding of and , respectively.

Approach # of Ensemble Models RACE-Middle RACE-High RACE-All
Previous SOTA:
BiAttention MRU Tay et al. (2018) 9 60.2 50.3 53.3
OFT Radford et al. (2018) - 62.9 57.4 59.0
Baselines (Our Implementations):
OFT - 60.9 57.8 58.7
OFT 2 62.6 58.4 59.6
OFT 9 63.5 59.3 60.6
Single Strategy:
Self-Assessment (SA) - 63.2 59.2 60.4
Highlighting (HL) - 67.4 61.5 63.2
Back and Forth Reading (BF) 2 67.3 60.7 62.6
Strategy Combination
SA + HL - 69.2 61.5 63.8
SA + HL + BF 2 70.9 63.2 65.4
SA + HL + BF 9 72.0 64.5 66.7
Amazon Turker Lai et al. (2017) - 85.1 69.4 73.3
Table 2: Accuracy () on the test set of RACE.

3.4 Self-Assessment

In the previous work Radford et al. (2018), the original pre-trained transformer is directly fine-tuned on an end task. Inspired by the self-assessment reading strategy, we propose a simple method to generate questions and their associated multiple span-based answer options, which cover the content of multiple sentences from a reference document. By first fine-tuning the pre-trained model on these “practice” instances, we aim to render the new resulting fine-tuned model more input structure aware and to integrate information across multiple sentences required to answer a given question.

Concretely, we randomly generate no more than questions and associated answer options based on each document from the end task (i.e., RACE in this paper). We describe the steps as follows.

  • Input: a document from the end task, which still serves as the reference document.

  • Output: a question and four answer options associated with the reference document.

  • Randomly pick no more than sentences from the document, and concatenate these sentences together.

  • Randomly pick no more than non-overlapping spans from the concatenation of sentences, each of which randomly contains no more than word tokens. We remove the selected spans, which are regarded as the correct answer option, from the concatenated sentences and use the remaining sentences as the question.

  • Generate three distractors (i.e., wrong answer options) by randomly replacing spans in the correct answer option with randomly picked spans from the reference document.

where , , , and are used to control the number and difficulty level of the generated questions.

Target: ARC
Approach Source Ensemble Accuracy (Easy | Challenge)
Previous SOTA IR Clark et al. (2018) - - 62.6 | 20.3
ET-RR Ni et al. (2018) - N/A | 36.6
Our Baseline OFT - 57.0 | 38.2
2 57.1 | 38.4
Our Approach Strategies - 66.6 | 40.7
2 68.9 | 42.3
Target: OpenBookQA
Approach Source Ensemble Accuracy
Previous SOTA Odd-One-Out Solver Mihaylov et al. (2018) - - 50.2
Our Baseline OFT - 52.0
2 52.8
Our Approach Strategies - 55.2
2 55.8
Target: MCTest
Approach Source Ensemble Accuracy (MC160 | MC500)
Previous SOTA Finetuned QACNN Chung et al. (2018) - 76.4 | 68.7
- 73.8 | 72.3
Our Baseline OFT - 65.4 | 61.5
2 65.8 | 61.0
Our Approach Strategies - 80.0 | 78.7
2 81.7 | 82.0
Target: MultiRC
Approach Source Ensemble | |
Previous SOTA LR Khashabi et al. (2018) - - 66.5 | 63.2 | 11.8
Our Baseline OFT - 69.3 | 67.2 | 15.2
2 70.3 | 67.7 | 16.5
Our Approach Strategies - 71.5 | 69.2 | 22.6
2 73.1 | 70.5 | 21.8
Target: SemEval-2018 Task 11
Approach Source Ensemble Accuracy
Previous SOTA TriAN Wang (2018) - 81.9
9 84.0
HMA Chen et al. (2018) - - 80.9
- 7 84.1
Our Baseline OFT - 88.0
2 88.6
Our Approach Strategies - 88.8
2 89.5
Target: ROCStories
Approach Source Ensemble Accuracy
Previous SOTA OFT Radford et al. (2018) - - 86.5
Our Baseline OFT - 87.1
2 87.5
Our Approach Strategies - 88.0
2 88.3
Table 3: Performance (%) on the test sets of ARC, OpenBookQA, MCTest, SemEval-2018, and ROCStories and the development set of MultiRC (: macro-average F1; : micro-average F1; : exact match accuracy). Approaches marked by ✓use RACE as the source task, except that ET-RR Ni et al. (2018) uses essential terms Khashabi et al. (2017) and Finetuned QACNN Chung et al. (2018) uses MovieQA Tapaswi et al. (2016).
Task ARC OpenBookQA MCTest MultiRC SemEval ROCStories Average
Easy | Chllenge - MC160 | MC500 - - - -
Metric Accuracy Accuracy Accuracy | | Accuracy Accuracy Accuracy
OFT 54.0 | 30.3 50.0 58.8 | 52.0 69.3 | 66.2 | 11.9 87.3 86.7 53.9
OFT () 53.9 | 30.7 50.0 60.0 | 54.0 69.3 | 66.5 | 13.1 88.0 87.0 54.6
Strategies 61.9 | 35.0 54.2 67.5 | 64.7 68.8 | 67.4 | 16.2 87.6 87.4 59.3
Strategies () 63.1 | 35.4 55.0 70.8 | 64.8 69.7 | 67.9 | 16.9 88.1 88.1 60.3
Table 4: Performance (%) on the test sets of ARC, OpenBookQA, MCTest, SemEval-2018 Task 11, and ROCStories and the development set of MultiRC using the target data only (i.e., without the data flow 1 and 2 boxed in Figure 1). (: macro-average F1; : micro-average F1; : exact match accuracy).

4 Experiment

4.1 Experiment Settings

For most of the hyperparameters, we follow the work of radfordimproving. We use the same preprocessing procedure and the released pre-trained transformer. We first fine-tune the original pre-trained model on the automatically generated instances (Section 

3.4) with

training epoch (data flow

boxed in Figure 1). In this stage, k instances are obtained based on the reference documents Lai et al. (2017) from the training and development set of RACE, with , , , and . We then adapt the model to a large-scale general domain MRC dataset RACE with training epochs (data flow boxed in Figure 1). Finally, we adapt the resulting fine-tuned model to one of the aforementioned six out-of-domain datasets (at max epochs). See data flow boxed in Figure 1. When we adapt the model to different datasets, we set language model weight to

and ensemble models by averaging logits after the linear layer. The informative POS tagset

{NN, NNP, NNPS, NNS, VB, VBD, VBG, VBN, VBP, VBZ, JJ, JJR, JJS, RB, RBR, RBS, CD, FW} (Section 3.3). and are initialized randomly.

4.2 Evaluation on RACE

In Table 2, we report the accuracy of the top two ranked methods on RACE. As RACE (or RACE-All) consists of two subtasks: RACE-Middle collected from middle school exams and RACE-High collected from high school exams, we also report the accuracy of our methods on both of them.

Our single and ensemble models outperform previous state-of-the-art (SOTA) by a large margin ( vs. ; vs. ). A single strategy – self-assessment and highlighting – improves over the single-model OFT () by and , respectively. Using the back and forth reading strategy, which involves two models, gives a improvement in accuracy compared to the ensemble of two original OFTs (). Strategy combination further boosts the performance. By combining strategies self-assessment and highlighting, our single model achieves a significant improvement in accuracy over the OFT baseline ( vs. ). We apply all the strategies by ensembling two such single models that read an input sequence in either the original or the reverse order, leading to a improvement in accuracy over the ensemble of two original OFTs ( vs. ).

4.3 Adaptation to Other Non-Extractive Machine Reading Comprehension Tasks

We follow the same philosophy of transferring the knowledge from a high-performing model pre-trained on a large-scale supervised data of a source task to a target task, in which only a relatively small number of training instances are available. In our experiment, we regard RACE as the source task since it contains the largest amount of general domain non-extractive questions (Table 1).

In our experiment, we regard five representative standard machine reading comprehension datasets from multiple domains as the target tasks: ARC, OpenBookQA, MCTest, MultiRC, and SemEval Task . We require some modifications to apply our method to other tasks considering their different structures. In datasets ARC and OpenBookQA, there is no reference document associated with each question. Instead, a reference corpus is provided, which consists of unordered science-related sentences relevant to questions. We first use Lucene McCandless et al. (2010) to retrieve the top

sentences from the reference corpus by using the non-stop words in a question and its answer options as queries. We use the retrieved sentences to form a reference document. In the MultiRC dataset, a question could have more than one correct answer option. Therefore, we use a sigmoid function instead of softmax at the final layer (Figure 

1) and regard the task as a binary (i.e., correct or incorrect) classification problem over each (document, question, answer option) instance. We also adapt our method to a non-conventional multiple-choice MRC dataset ROCStories, which aims at choosing the correct ending to a four-sentence story from two answer options Mostafazadeh et al. (2016). Since no explicit questions are provided, we leave the question sequence empty.

We investigate the effectiveness of our method in different settings. We first fine-tune the original OFT using strategies on RACE and then fine-tune the resulting model on one of the six target tasks (see Table 3). Since the test set of MultiRC is not publicly available, we report the performance of the model that achieves the highest micro-average F1 () on the development set. For other tasks, we select the model that achieves the best accuracy on the development set. When the OFT baseline is fine-tuned on RACE wihtout the use of strategies, it already outperforms previous state-of-the-art (SOTA) on four out of six datasets (OpenBookQA, MultiRC, SemEval-2018 Task 11, and ROCStories). By using strategies (Section 3) during the fine-tuning stage on RACE, we improve the performance of the baseline, leading new SOTA results on all the six datasets. We notice that, even without fine-tuning on the target data (i.e., removing data flow in Figure 1), our method already achieves strong performance on ARC Challenge (), MCTest ( | ), and MultiRC ( | | ) over previous state-of-the-art.

To further investigate the contribution of strategies, we compare our approach and the original OFT without using the extra labeled data from RACE (i.e., only keeping data flow in Figure 1). As shown in Table 4, both our single and ensemble model consistently outperform OFT. We obtain relative improvement in average accuracy over the baseline on all the datasets and especially significant improvements on datasets ARC, OpenBookQA, and MCTest.

4.4 Further Discussions on Strategies

Back and Forth Reading We notice that the input order difference between two ensemble models is likely to yield performance gains. Besides ensembling two models that use input sequence and respectively, we also investigate other reverse or almost reverse pairs. For example, we achieve better results by ensembling and () or and (), compared to the ensemble of two original OFTs on the RACE dataset ( in Table 2).

Highlighting We try two variants to define highlight embeddings (Equation 2 in Section 3.3) by considering the content of questions only or answer options only. Experiments show that using partial information yields a decrease in accuracy ( and respectively) compared to (Table 2), achieved by considering the content words in a question and its answer options.

Self-Assessment We explore alternative approaches to generate questions. For example, we use the Wikipedia articles from SQuAD Rajpurkar et al. (2016) instead of the general domain documents from the end task RACE. We generate the same number of questions as the number of questions we generate using RACE following the same steps mentioned in Section 3.4. Experiments show that this method also improves the accuracy of the OFT baseline ( vs.  ).

As self-assessing can be somehow regarded as a data augmentation method, we investigate other unsupervised question generation methods: sentence shuffling Ding and Zhou (2018) and paraphrasing based on back-translation Yu et al. (2018). Our experiments demonstrate that neither of them results in performance improvements.

5 Related Work

5.1 Methods for Multiple-Choice MRC

Here we primarily discuss methods applied to large-scale datasets such as RACE. Researchers employ a variety of methods with attention mechanisms Chen et al. (2016); Dhingra et al. (2017) for improvement through adding an elimination module Parikh et al. (2018), applying hierarchical attention strategies Zhu et al. (2018); Wang et al. (2018a)

, using reinforcement learning to determine the choice of multiple attention strategies 

Xu et al. (2018), or applying a new compositional encoder Tay et al. (2018). However, these methods seldom take the rich external knowledge (other than pre-trained word embeddings) and document structures into consideration. In this paper, we investigate different strategies based on an existing pre-trained transformer Radford et al. (2018) (Section 3.1), which leverages rich linguistic knowledge from an external corpus and achieves state-of-the-art performance on a wide range of natural language tasks.

5.2 Transfer Learning for Question Answering and MRC

Transfer learning techniques have been successfully applied to machine reading comprehension Chung et al. (2018); Golub et al. (2017) and question answering Min et al. (2017); Wiese et al. (2017). Compared to previous work, we simply fine-tune our model on the source data and then further fine-tune the entire model on the target data. The investigation of strategies such as varying the pre-training/fine-tuning data size, adding additional parameters or an L2 loss, combining different datasets for training, and fine-tuning only part of the parameters is beyond the scope of this work.

5.3 Data Augmentation for MRC Without Using External Datasets

Previous methods augment the training data by randomly reordering words or shuffling sentences Ding and Zhou (2018); Li and Zhou (2018). Question generation and paraphrasing methods have also been explored Yang et al. (2017); Yuan et al. (2017) on extractive MRC, requiring a large amount of training data or limited by the number of training instances Yu et al. (2018). In comparison, our problem (i.e., question and answer options) generation method does not rely on any existing questions in the training set, and we focus on generating problems involving the content of multiple sentences in a reference document.

6 Conclusions and Future Work

Inspired by previous research on using reading strategies to improve comprehension levels of human readers, we propose three strategies – back and forth reading, highlighting, and self-assessment – based on a pre-trained transformer, aiming at improving machine reading comprehension using limited resources.

By applying the three strategies, we obtain an accuracy of , a absolute improvement over the state-of-the-art fine-tuned transformer on the RACE dataset. By fine-tuning the pre-trained transformer on RACE using strategies, the resulting model outperforms significantly the same pre-trained transformer fine-tuned on RACE without the use of strategies, achieving new state-of-the-art results on six representative non-extractive machine reading comprehension datasets from multiple domains that require a diverse range of comprehension skills. These results consistently indicate the effectiveness of our proposed strategies and the general applicability of our fine-tuned model that incorporates these strategies.

In the future, we are interested in combining our strategies with other pre-trained models, generating more challenging problems which require more advanced skills such as summarization and sentiment analysis, and adapting our framework to more natural language processing tasks whose input is significantly different from that of the multiple-choice MRC tasks.

References