Frustratingly Poor Performance of Reading Comprehension Models on Non-adversarial Examples

04/04/2019 ∙ by Soham Parikh, et al. ∙ Indian Institute Of Technology, Madras 0

When humans learn to perform a difficult task (say, reading comprehension (RC) over longer passages), it is typically the case that their performance improves significantly on an easier version of this task (say, RC over shorter passages). Ideally, we would want an intelligent agent to also exhibit such a behavior. However, on experimenting with state of the art RC models using the standard RACE dataset, we observe that this is not true. Specifically, we see counter-intuitive results wherein even when we show frustratingly easy examples to the model at test time, there is hardly any improvement in its performance. We refer to this as non-adversarial evaluation as opposed to adversarial evaluation. Such non-adversarial examples allow us to assess the utility of specialized neural components. For example, we show that even for easy examples where the answer is clearly embedded in the passage, the neural components designed for paying attention to relevant portions of the passage fail to serve their intended purpose. We believe that the non-adversarial dataset created as a part of this work would complement the research on adversarial evaluation and give a more realistic assessment of the ability of RC models. All the datasets and codes developed as a part of this work will be made publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related Work

This section is organized into parts (i) The first part introduces the datasets and models for RC QA (ii) The second part introduces adversarial evaluation.

QA Datasets & Models: Over the past few years, several large-scale datasets have been proposed for RC, inspiring increasingly complex models with many components. These datasets are of varied flavors and differ in whether the answer should be generated/selected/extracted. Cloze-style datasets like the CNN and Daily Mail [Hermann et al., 2015], Children’s Book Test [Hill et al., 2015] and Who Did What [Onishi et al., 2016] contain the answer inside the passage as an entity/verb/adjective, etc. SQuAD [Rajpurkar et al., 2016], TriviaQA [Joshi et al., 2017] and NEWSQA [Trischler et al., 2016] require the RC model to predict a continuous span of text inside the passage as the answer. MSMARCO [Nguyen et al., 2016] contains human generated answers which in turn requires the models to perform answer generation. Other datasets like MCTest [Richardson, Burges, and Renshaw, 2013], NTCIR QA Lab Task [Shibuki et al., 2014] and RACE [Lai et al., 2017] contain multiple choice queries (MCQ) where the task is to select the correct option from a given set of options.

Most of the recent models proposed on these datasets have specialized attention modules like (i) Query-aware context representation [Xiong, Zhong, and Socher, 2016, Seo et al., 2016, Chen, Bolton, and Manning, 2016, Cui et al., 2017, Dhingra et al., 2017] and (ii)Self-aware context representation [Wang et al., 2017, Hu, Peng, and Qiu, 2017]. These attention-based modules aim at focusing on important passage words based on the information (i.e, query or passage words) provided. In this work, we argue that it is important to assess the utility of these modules using non-adversarial examples.

Adversarial Evaluation: This line of work focuses on bringing out the poor generalization abilities of Deep Neural Networks by feeding them carefully constructed adversarial examples. [Goodfellow, Shlens, and Szegedy, 2014] showed that by adding a small amount of carefully designed noise to an image, it is possible to fool image classification models into predicting a wrong label for the given image (even though there is no visible difference in the image after perturbation).

Adding noise to natural language text while preserving the semantics is tricky but some attempts have been made in this direction. [Li, Monroe, and Jurafsky, 2016] erase words from a discourse while trying to maintain the meaning of the text. [Zhengli Zhao, 2018] use auto-encoders to map discrete text to continuous embeddings and then add perturbations in the continuous space. [Jia and Liang, 2017] use SQuAD dataset to show that if we insert adversarial sentences in the passage such that these sentences have a high word overlap with the answer sentence without affecting the answer, then the models get confused and the performance drops significantly. This exposes the vulnerability of the models in adversarial settings. In this paper, we explore the other end of the spectrum and show that when using non-adversarial examples which are significantly easier for humans to answer, there is hardly any improvement in the performance of existing RC models.

Choice of Dataset

As mentioned earlier, in this work, we focus on the RACE dataset. RACE is a large scale MCQ style dataset which gives us more scope for creating non-adversarial examples by suitably simplifying the (i) passage, (ii) query and/or (iii) options. More importantly, this is a hard dataset with a significant fraction of questions requiring reasoning and inference (refer to the [Lai et al., 2017] for statistical details) where even state of the art models have been able to reach only 45% accuracy while the human performance is about %. There is enough scope for the models to give an improved performance on non-adversarial examples over the actual test examples from RACE. In contrast, for some of the other large-scale datasets such as SQuAD and CNN/Daily Mail, state of the art models have already reached near human performance and hence there isn’t enough scope for improvement in their performance on non-adversarial examples. There is a clear scope for designing such strategies for other flavors of QA such as (i) TriviaQA [Joshi et al., 2017] which requires evidence from multiple passages and (ii) MS MARCO [Nguyen et al., 2016] which requires the generation of answers. Our work is an important first step in this direction and we hope that it will fuel interest in the development of such strategies for these other flavors of QA also.

Non-Adversarial Examples

In this section, we describe different ways of creating non-adversarial examples.

Modifying the passage

We propose different ways of modifying the passage to provide explicit or implicit hints about the answer.

P1 - Append answer to the passage: The simplest thing is to append the correct answer at the end of the passage. This is very naive and there is no reason why a model (or even a human for that matter) should be able to answer the query as the answer is placed out of context. However, we list it here for the sake of completeness.

P2 - Append query & answer to the passage: If both the query and the answer are appended at the end of the passage then most humans would be easily able to answer the query without having to read this passage. This just requires very basic comprehension skill and we should expect a trained QA model to be able to find the answer. We can also think of the query as providing context for the answer.

P3 - Append query, answer as a declarative sentence: Building on the above intuition, to simplify things even further, we combine the query and the answer to form a sentence and append this sentence at the end of the passage. A majority of the queries in RACE are fill-in-the-blank style queries. It is straightforward to convert these query-answer pairs into a declarative sentence by simply replacing the blank with the answer. For other types of queries, we create declarative sentences (refer to Figure 2) by using manually defined rules over CoreNLP constituency parses 111We use the publicly available code provided by [Jia and Liang, 2017].

P4 - Simplify the passage An alternative way to reduce the difficulty of the passages, we pass them through an automatic text simplification model [Nisioi et al., 2017] and use the simplified passages during test time.

P5 - Retain only relevant sentences: To remove extraneous information, we only retain those sentences of the passage which are needed to answer the query. In other words, the new passage is a query-specific summary of the original passage. Specifically, we collect these examples with the help of in-house annotators (proficient in English) who are shown the passage, query, and answer and are asked to retain only sentences from the passage which are required to answer the query.

P6 - Replace passage by query & answer: This is an even more simplified version of P2 wherein we replace the entire passage by the concatenation of query and answer. Typically, a simple bag-of-words model should also give a high performance on this dataset.

P7 - Replace passage by query & answer as a declarative sentence: Analogous to P5, this dataset is a simplified version of P3 where the passage is replaced entirely by the declarative sentence formed by combining the query and the answer. Passages in P5 and P6 can again be considered to be variants of a query-specific summary of the passage and are embarrassingly simple and to the point. Most humans would be able to answer the queries with % accuracy.

P8 - Replace passage with the answer: The entire passage is replaced by the correct option. While the resulting passage hardly makes any syntactic and semantic sense, we include this for the sake of completeness. However, it is worth mentioning that if a human is given a triplet containing {answer, query, options} instead of {passage, query, options}, in the absence of any other information, the human will simply pick the answer (that is the most elementary thing to do), obtaining % accuracy. Again, a simple bag-of-words model would also be able to predict answers correctly with high accuracy.

P9 - Place explicit hints to the answer: We explicitly add the following text at the beginning of the passage “The answer to $QUERY$ is at the end of the passage”. Simultaneously, we append the following text at the end of the passage, “The answer to $QUERY$ is $ANSWER$”. $QUERY$ and $ANSWER$ are variables which are replaced by the actual query and the answer. Such hints make the task embarrassingly easy for humans, who would simply read the hint in the first sentence, skip reading the passage and pick the answer from the hint placed at the end.

Passage: Hidden in a small street in the south end of Springfield … Frigo’s is an Italian restaurant … I stepped into Frigo’s almost by accident when … I have a feeling that I’ll be picking up dinner for me and the kids at Frigo’s soon. (272 words)
Query: How much did the writer pay for his first meal at Frigo’s?
Summarized Passage: I stepped into Frigo’s almost by accident when I had to stay in Springfield into the evening for an open house at the school where I work. I ordered the easiest meal possible: a chicken sandwich and a salad. It cost $4.75 for the sandwich. The salad was $4.99 and didn’t have salad dressing on it. (56 words)

Figure 1: Example from P4 dataset

Query: What does the sentence “You’re quite a fellow to build this bridge!” mean?
Answer: John was great to build this bridge.
Declarative Sentence: The sentence “You’re quite a fellow to build this bridge!” means John was great to build this bridge.

Figure 2: Example of declarative sentence created from query and answer

Modifying the query

We take inspiration from school/college textbooks where tough queries are often appended with hints. On similar lines, we propose two simple modifications to the query.

Q1 - Append answer as a hint to the query: We append the text ”Answer is $ANSWER$” at the end of the query. Such a hint just gives away the answer and it is extremely trivial for any human to answer such a query.

Q2 - Appending not-an-answer hint to the query : Since the hint in Q1 is too direct, in Q2 we indicate the wrong options by appending the text ”Answer is not $OPTION1$, $OPTION2$, $OPTION3$” to the query, assuming without loss of generality that the option is the correct option. Again, it would be very easy for a human to answer the query in the presence of such a hint.

Modifying the options

We propose the following ways of simplifying the options to make things easier for the model:

O1 - Replace each option by query & option as a declarative sentence: We do not claim that this makes things dramatically easy for the model but the idea here is that the declarative sentence should help the model to read the query and options together in a better light.

O2 - Replace wrong options with options from other example(s) : While the options are easier to read and comprehend in O1, the model has to still distinguish between confusing options. To simplify things, we replace the incorrect options in every example with randomly selected options from other examples. The idea is that since the

wrong options are not relevant to the query (or the passage, in most cases), the reader should be able to assign low probability scores to these. Again, most humans would find this setup much easier and would be able to pick the right answer by elimination. We validate this by conducting human evaluations where the evaluators are only shown the query and

options. They are able to reach a performance of on a test set of such examples without even reading the passage.

O3 - Reducing the number of options: With four options, the chance of randomly guessing the answer is %. As we reduce the number of options the chance of randomly guessing improves (% with options and % with options). It would be interesting to see if there is a dramatic relative improvement in the performance of the model as compared to the random baseline when fewer options are provided. For example, with 4 options if the model gets % relative improvement w.r.t the random baseline of % then it would be interesting to see whether with reduced (2) options this relative improvement over the random baseline of % is greater than %.
Moreover, instead of randomly dropping an incorrect option, we also create a test set where we ask in-house human annotators to select the most confusing incorrect option for a given tuple of {passage, query, options}. This is done to check whether the performance of models is better in this case as opposed to randomly dropping an option.

Models employed

In the section, we describe the various state of the art models that we evaluated on the above non-adversarial examples. These models were originally proposed for SQuAD dataset wherein the task is to predict the correct span in the passage. They do not have components for encoding or selecting the options. Of these, viz., Gated Attention Reader (GAR) [Dhingra et al., 2017] and Stanford Attention Reader (SAR) [Chen, Bolton, and Manning, 2016] have already been adapted for the RACE dataset by suitably modifying them to encode the options and select the right option instead of predicting a span). We refer the reader to the original RACE paper [Lai et al., 2017] to see these modifications. In a similar vein, we suitably modify 3 other models and adapt them to the RACE dataset as described below.

Dynamic Co-attention Network (DCN)

This model [Xiong, Zhong, and Socher, 2016]

consists of three modules: (i) document and query encoder, (ii) co-attention module and (iii) dynamic pointing decoder. First, we use a separate LSTM to encode the options and use the state of the LSTM at the last time-step as the vector representation of the option. The co-attention module used to pay attention over passage and query words simultaneously is used without any modifications for the RACE dataset. Lastly, we need to replace the dynamic pointing decoder which is used to predict the start and end locations of the answer span. We use the same modifications that

[Lai et al., 2017] proposed to adapt GAR and SAR for the RACE dataset. Specifically, we replace the output module by a simple bilinear attention layer which computes the bilinear similarity between document word representations and query representation (i.e., the final hidden state of the LSTM used for encoding the query). We then normalize these word, query similarities using a softmax function to compute the attention weight for each passage word in light of the query. Using these weights, we use the weighted sum of passage words as the passage representation, which in turn is used to compute the bilinear similarity with the representation of each option. The option with the highest similarity score is predicted as the answer.

Bi-Directional Attention Flow

BiDAF is a very complex hierarchical multi-stage model containing 6 layers. We make some simplifications to this model for ease of experimentation and some modifications to adapt it to the RACE dataset. We do not use a character embedding layer and use a simple word embedding layer as opposed to the two-layer highway network in the original paper. For the contextual embedding layer described in the original paper, we use LSTMs for computing the representation of the passage and query and add another LSTM for computing the representations of the option. We retain the attention flow layer and the modeling layer as they are. To obtain a fixed length passage representation, a weighted222Attention weights computed as in Equation 3 in [Seo et al., 2016] sum of passage word representations is computed. For predicting the answer, the method from the DCN description follows.

Mnemonic Reader (MNR)

as proposed in [Hu, Peng, and Qiu, 2017] does iterative alignment in two steps: () interactive alignment between query and document to generate query-aware passage representation (i.e., a representation which focuses on important parts of the passage in light of the query) and () self-alignment of the query-aware passage representation with itself to make the representation self-aware (i.e., to fuse information across the passage to capture long range dependency between words). Here again, we mainly change the output layer to compute attention weights333refer Equation 15 in [Hu, Peng, and Qiu, 2017] for a weighted sum of passage words. For predicting the answer, the method from the BiDAF and DCN descriptions follow.

RACE 44.08 43.3 41.75 41.75 41.97
P1 44.77 44.93 41.97 42.4 42.3
P2 44.69 45.26 42.1 42.36 42.2
P3 44.63 44.91 42.06 42.38 42.22
P4 36.56 43.04 32.18 41.73 32.30
P6 46.84 55.65 46.39 46.92 47.22
P7 47.22 55.55 46.66 46.05 47.83
P8 49.53 65.99 51.66 50.2 53.91
P9 44.45 45.12 41.97 42.56 42.14
Q1 42.46 42.54 41.61 40.64 42.62
Q2 34.21 37.19 36.68 34.35 27.79
O1 39.04 39.87 39.93 39.14 39.28
O2 43.6 43.17 41 41.02 37.86
O3(3) 52.03 51.64 50.06 50.43 50.81
O3(2) 65.89 66.21 65.71 65.14 64.98
Table 1: Results of the 5 models when trained on original RACE dataset and tested on the non-adversarial versions of it. O3(3) and O3(2) correspond to having 3 and 2 options respectively.


We train each of the

models described above on the training set of the RACE dataset. We tune the hyperparameters of these models using the validation set of the RACE dataset to give the best performance. For benchmarking, we report the performance of these models on the test set on the RACE dataset. We then create

non-adversarial test sets from the test set of the RACE dataset (P1 to P7, Q1-Q2 and O1 to O3). The hypothesis is that if the model has truly learned Natural Language Understanding (NLU) then its performance should be much better on these non-adversarial datasets than on the original RACE test set (just as we expect most humans to excel on these non-adversarial test examples). The results of our experiments are summarized in Table 1. We make a few observations from these results:

On P2 and P3, where the answer is present in the document along with the relevant context, we hardly see any improvement in the performance as compared to that on the original RACE dataset.
On P5 and P6, where only the answer along with its relevant context is present, the improvements of out of models are marginal. While improvement is significant with SAR model, the performance is still just above %.
SAR, which is the most simplistic model, gives the best performance on out of datasets. This, in turn, indicates that the more complex models have overfit on the original RACE dataset and are poor at generalizing on non-adversarial examples.
For the dataset O3 with and options, the relative improvements of the best performing models over random guessing are % and % respectively while the relative improvement over random guessing on RACE dataset (with options) is %. This suggests that removing possibly confusing options does not seem to simplify things for the models by great extent.

RACE 41.6 39 39.2 40 39.2
P5 41.88 41.08 41.28 39.68 43.09

Table 2: Results of the 5 models when trained on original RACE and tested on a subset of RACE test set ( examples) and on the corresponding P5 created using this test set

We also evaluate models on the human annotated datasets i.e., P5, where annotators select only the sentences required to answer the query and O3(H), where the most confusing option indicated by annotators is dropped. Since this test set is created from a subset ( examples) of the original RACE test set, we compare the performance of models on P5 to their performance on the corresponding original subset in Table 2. For O3(H), in Table 3, we compare the performance with a test set consisting of the same examples where an incorrect option is randomly dropped from each example.

O3(3) 49.1 51.3 50.3 47.29 50.5
O3(H) 47.49 48.1 47.09 49.5 49.9

Table 3: Results of the 5 models when trained on original RACE and tested on O3(3) and O3(H). O3(3) corresponds to having 3 options, with one randomly dropped while O3(H) corresponds to dropping the most confusing options as judged by human annotators

Note that the RACE dataset also has a natural easy-hard split because it contains questions from mid school and high school exams (mid-school being presumably easier). So we did another experiment where we train the model on high school examples and evaluate it on both middle and high school examples at test time. Here again, from Table 4 we observe that the performance of the model on the mid-school test is in fact lower than its performance on the high school test set.

RACE-H 41.57 41.51 41.02 40.17 41.14
RACE-M 36.56 39.62 39.9 38.23 39.62
Table 4: Results of the 5 models when trained on RACE-H dataset and tested on the RACE-M test set.

Lastly, we also wanted to check what happens to the performance of the model when trained on non-adversarial examples. We use non-adversarial versions of the training data to train separate models for each non-adversarial training sets. We then evaluate the trained models on the corresponding non-adversarial test set (i.e., evaluate the model trained using P1 type modifications on the test set containing P1 type modifications). As seen in Table 5, the performance on non-adversarial test set now improves drastically (close to %) whereas the performance on the RACE test set is close to that given by random guessing, thereby showing that these models are only capable of learning patterns in the training data and do not exhibit any NLU.

P9 98.66 98.89 98.01 98.46 95.07
RACE 29.07 27.04 27.64 28.25 29.45

Table 5: Results of the 5 models when trained on P9 and tested on the original RACE test set

Discussions and Analysis

Attention modules are an important component of all the models described in Section Models employed. Query-aware attention modules use query information to select important passage words. Self-matching modules use information from other passage words to select important passage words. These attention weights are then used to compute a passage representation which pays more attention to these words. Most previous works only do a qualitative analyses of the weights learned by these modules using a handful of examples. In this section, we show how specific non-adversarial examples can be used to quantitatively analyze the performance of these components.

Model/Data UAA UAQ MRR
GAR P2 28.64 37.66 0.0089
P6 56.08 43.92 0.289
SAR P2 27.97 27.97 0.007
P6 48.24 51.76 0.324
DCN P2 65.12 55.53 0.09
P6 51.44 48.56 0.388
BiDAF P2 27.22 44.02 0.005
P6 42.93 57.07 0.21
MNR P2 32.45 31.52 0.0124
P6 50.08 49.92 0.32
Table 6: UAA (Uniform Attention to Answer) and

UAQ (Uniform Attention to Query) denote the percentage of examples with less than uniform attention weight assigned to the corresponding N-gram. MRR is the Mean Reciprocal Rank of the N-gram based on the total attention weight.

Output Layer Attention

As described in section Models employed, the output layer of all the models uses information from the query to compute the attention weights (or importance) of all the passage words. We consider the non-adversarial datasets P2 and P6 in which the answer is present verbatim inside the passage. While ideally, we would want all the attention weight to be distributed on these answer words, other parts of the passage may also contain information relevant for answering the query. We consider that in the absence of any information the models must simply learn a uniform attention over all the passage words (i.e., the weight assigned to each word should be where is the total number of words in the passage). Now, if the model has truly learned to pay attention to important words then we expect these attention weights to be distributed in such a way that the answer () words get more than uniform attention. If we denote the passage as a sequence of words and the attention weights assigned by this module to the words as , we expect i.e., we expect the total attention mass on answer words to be more than the attention mass on remaining words. We observe that for over % of the examples in P2 and for over % of the examples in P6, the total attention weights assigned to answer words is less than what would have been assigned using uniform attention over all words. In the case of P6 the length of the passage is very small and hence, there is no distraction as the entire passage is simply replaced by the query and the answer. This suggests that these complex attention components do not really learn to pay attention to relevant words.

We also do another quantitative analysis wherein we consider all -grams in the passage which have the same length as the answer. We compute a score for each -grams as the sum of the attention weights on all the words in the -gram. We then rank these -grams based on this score and compute the MRR of the answer -grams in this ranked list. As shown in Table 6, we observe that the MRR is significantly low for each model for the dataset P2. The MRRs on P6 are higher due to shorter passage length.

Query-Aware Document Attention

GAR, DCN, BiDAF and MNR, each compute an affinity matrix

, where represents the similarity between the passage word at index and query word at index . This module aims to highlight words in the passage which are important for each query word. Again, we use the datasets P2 and P6, where the query is embedded verbatim inside the passage. Following arguments similar to the ones presented in Section Output Layer Attention, it is natural to expect high attention weights to the query n-gram present in the passage. We rank each N-gram of length based on the Frobenius norm of the sub-matrix and compute the MRR of the query N-gram. From Table 7, we observe that for P2, this MRR for each model is considerably low even though the N-gram has an exact match with the query. For P6, the MRR is high as expected since there is no text except for the query and the answer. Interestingly, the MRR for the answer N-gram computed similarly is higher than that of the query N-gram in out of models on the dataset P2. This further raises the question whether the query-aware attention modules fully serve their purpose.

P2 P6 P2 P6 P2 P6 P2 P6
MRR(Q) 0.041 0.561 0.025 0.57 0.024 0.503 0.08 0.72
MRR(A) 0.06 0.363 0.09 0.492 0.058 0.497 0.048 0.25
Table 7: MRR for query (MRR(Q)) and answer (MRR(A)) N-grams in Query-Aware Passage Attention layer

Self-Matching Attention

MNR computes a self-affinity matrix , where is the mutual affinity between words at indices and in the passage. While this is useful in multi-sentence reasoning, it is difficult to annotate the related sentences in all the passages for the original dataset. Instead, we use dataset P9, the first and the last sentences of each passage are related to a good extent. Recall that in P9, the first sentence contains the query and a hint that the answer is located in the last sentence. The last sentence in turn contains the query as well as the answer. We rank each N-gram of length in the passage (except for the N-grams in the first sentence) based on the Frobenius norm of the sub-matrix , where and are the lengths of the first and last sentences, and compute the MRR of the last sentence (as an N-gram of length ). The MRR we get is , which indicates that indeed gives a high attention to sentences with overlapping words and serves its purpose. However, poor performance indicates there is still a lot to be desired in terms of NLU.

Conclusion and Future Work

We propose methods for generating non-adversarial examples for evaluating RC models and to the best of our knowledge, this is the first step in this direction. The failure to perform well and generalize on these examples supports the argument that existing RC models do not really exhibit any NLU but simply do pattern matching and overfit on the given data. Using specific non-adversarial examples, we propose methods to quantify the effectiveness of intermediate attention modules. We hope that our work will further encourage (i) creation of non-adversarial examples on other datasets (ii) methods for quantitatively analyzing other RC modules like multi-hop and multi-perspective matching and (iii) design of RC models with better natural language abilities.