Did the Model Understand the Question?

05/14/2018 ∙ by Pramod Kaushik Mudrakarta, et al. ∙ Google The University of Chicago 0

We analyze state-of-the-art deep learning models for three tasks: question answering on (1) images, (2) tables, and (3) passages of text. Using the notion of attribution (word importance), we find that these deep networks often ignore important question terms. Leveraging such behavior, we perturb questions to craft a variety of adversarial examples. Our strongest attacks drop the accuracy of a visual question answering model from 61.1% to 19%, and that of a tabular question answering model from 33.5% to 3.3%. Additionally, we show how attributions can strengthen attacks proposed by Jia and Liang (2017) on paragraph comprehension models. Our results demonstrate that attributions can augment standard measures of accuracy and empower investigation of model performance. When a model is accurate but for the wrong reasons, attributions can surface erroneous logic in the model that indicates inadequacies in the test data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep learning has been applied to a variety of question answering tasks. For instance, to answer questions about images (e.g. Kazemi and Elqursh (2017)), tabular data (e.g. Neelakantan et al. (2017)), and passages of text (e.g. Yu et al. (2018)). Developers, end-users, and reviewers (in academia) would all like to understand the capabilities of these models.

The standard way of measuring the goodness of a system is to evaluate its error on a test set. High accuracy is indicative of a good model only if the test set is representative of the underlying real-world task. Most tasks have large test and training sets, and it is hard to manually check that they are representative of the real world.

In this paper, we propose techniques to analyze the sensitivity of a deep learning model to question words. We do this by applying attribution (as discussed in section 3), and generating adversarial questions. Here is an illustrative example: recall Visual Question Answering Agrawal et al. (2015) where the task is to answer questions about images. Consider the question “how symmetrical are the white bricks on either side of the building?” (corresponding image in Figure 1). The system that we study gets the answer right (“very”). But, we find (using an attribution approach) that the system relies on only a few of the words like “how” and “bricks”. Indeed, we can construct adversarial questions about the same image that the system gets wrong. For instance, “how spherical are the white bricks on either side of the building?” returns the same answer (“very”). A key premise of our work is that most humans have expertise in question answering. Even if they cannot manually check that a dataset is representative of the real world, they can identify important question words, and anticipate their function in question answering.

1.1 Our Contributions

We follow an analysis workflow to understand three question answering models. There are two steps. First, we apply Integrated Gradients (henceforth, IG) Sundararajan et al. (2017) to attribute the systems’ predictions to words in the questions. We propose visualizations of attributions to make analysis easy. Second, we identify weaknesses (e.g., relying on unimportant words) in the networks’ logic as exposed by the attributions, and leverage them to craft adversarial questions.

A key contribution of this work is an overstability test for question answering networks. JL17 showed that reading comprehension networks are overly stable to semantics-altering edits to the passage. In this work, we find that such overstability also applies to questions. Furthermore, this behavior can be seen in visual and tabular question answering networks as well. We use attributions to a define a general-purpose test for measuring the extent of the overstability (sections 4.3 and 5.3). It involves measuring how a network’s accuracy changes as words are systematically dropped from questions.

We emphasize that, in contrast to model-independent adversarial techniques such as that of  JL17, our method exploits the strengths and weaknesses of the model(s) at hand. This allows our attacks to have a high success rate. Additionally, using insights derived from attributions we were able to improve the attack success rate of JL17 (section 6.2). Such extensive use of attributions in crafting adversarial examples is novel to the best of our knowledge.

Next, we provide an overview of our results. In each case, we evaluate a pre-trained model on new inputs. We keep the networks’ parameters intact.

[leftmargin=*]

Visual QA (section 4):

The task is to answer questions about images. We analyze the deep network in kazemi2017show. We find that the network ignores many question words, relying largely on the image to produce answers. For instance, we show that the model retains more than 50% of its original accuracy even when every word that is not “color” is deleted from all questions in the validation set. We also show that the model under-relies on important question words (e.g. nouns) and attaching content-free prefixes (e.g., “in not many words, ”) to questions drops the accuracy from % to 19%.

QA on tables (section 5):

We analyze a system called Neural Programmer (henceforth, NP) Neelakantan et al. (2017) that answers questions on tabular data. NP determines the answer to a question by selecting a sequence of operations to apply on the accompanying table (akin to an SQL query; details in section 5). We find that these operation selections are more influenced by content-free words (e.g., “in”, “at”, “the”, etc.) in questions than important words such as nouns or adjectives. Dropping all content-free words reduces the validation accuracy of the network from 111This is the single-model accuracy that we obtained on training the Neural Programmer network. The accuracy reported in the paper is 34.1%. to . Similar to Visual QA, we show that attaching content-free phrases (e.g., “in not a lot of words”) to the question drops the network’s accuracy from to . We also find that NP often gets the answer right for the wrong reasons. For instance, for the question “which nation earned the most gold medals?”, one of the operations selected by NP is “” (pick the first row of the table). Its answer is right only because the table happens to be arranged in order of rank. We quantify this weakness by evaluating NP on the set of perturbed tables generated by pasupat2016inferring and find that its accuracy drops from to . Finally, we show an extreme form of overstability where the table itself induces a large bias in the network regardless of the question. For instance, we found that in tables about Olympic medal counts, NP was predisposed to selecting the “” operator.

Reading comprehension (Section 6):

The task is to answer questions about paragraphs of text. We analyze the network by YuDLLZC2018. Again, we find that the network often ignores words that should be important. JL17 proposed attacks wherein sentences are added to paragraphs that ought not to change the network’s answers, but sometimes do. Our main finding is that these attacks are more likely to succeed when an added sentence includes all the question words that the model found important (for the original paragraph). For instance, we find that attacks are 50% more likely to be successful when the added sentence includes top-attributed nouns in the question. This insight should allow the construction of more successful attacks and better training data sets.

In summary, we find that all networks ignore important parts of questions. One can fix this by either improving training data, or introducing an inductive bias. Our analysis workflow is helpful in both cases. It would also make sense to expose end-users to attribution visualizations. Knowing which words were ignored, or which operations the words were mapped to, can help the user decide whether to trust a system’s response.

2 Related Work

We are motivated by JL17. As they discuss, “the extent to which [reading comprehension systems] truly understand language remains unclear”. The contrast between JL17 and our work is instructive. Their main contribution is to fix the evaluation of reading comprehension systems by augmenting the test set with adversarially constructed examples. (As they point out in Section 4.6 of their paper, this does not necessarily fix the model; the model may simply learn to circumvent the specific attack underlying the adversarial examples.) Their method is independent of the specification of the model at hand. They use crowdsourcing to craft passage perturbations intended to fool the network, and then query the network to test their effectiveness.

In contrast, we propose improving the analysis of question answering systems. Our method peeks into the logic of a network to identify high-attribution question terms. Often there are several important question terms (e.g., nouns, adjectives) that receive tiny attribution. We leverage this weakness and perturb questions to craft targeted attacks. While JL17 focus exclusively on systems for the reading comprehension task, we analyze one system each for three different tasks. Our method also helps improve the efficacy JL17’s attacks; see table 4 for examples. Our analysis technique is specific to deep-learning-based systems, whereas theirs is not.

We could use many other methods instead of Integrated Gradients (IG) to attribute a deep network’s prediction to its input features Baehrens et al. (2010); Simonyan et al. (2013); Shrikumar et al. (2016); Binder et al. (2016); Springenberg et al. (2014). One could also use model agnostic techniques like ribeiro2016should. We choose IG for its ease and efficiency of implementation (requires just a few gradient-calls) and its axiomatic justification (see  STY17 for a detailed comparison with other attribution methods).

Recently, there have been a number of techniques for crafting and defending against adversarial attacks on image-based deep learning models (cf. adversarial). They are based on oversensitivity of models, i.e., tiny, imperceptible perturbations of the image to change a model’s response. In contrast, our attacks are based on models’ over-reliance on few question words even when other words should matter.

We discuss task-specific related work in corresponding sections (sections 6, 5 and 4).

3 Integrated Gradients (IG)

We employ an attribution technique called Integrated Gradients (IG) Sundararajan et al. (2017) to isolate question words that a deep learning system uses to produce an answer.

Formally, suppose a function represents a deep network, and an input . An attribution of the prediction at input relative to a baseline input

is a vector

where is the contribution of to the prediction . One can think of

as the probability of a specific response.

are the question words; to be precise, they are going to be vector representations of these terms. The attributions are the influences/blame-assignments to the variables on the probability .

Notice that attributions are defined relative to a special, uninformative input called the baseline

. In this paper, we use an empty question as the baseline, that is, a sequence of word embeddings corresponding to padding value. Note that the context (image, table, or passage) of the baseline

is set to be that of ; only the question is set to empty. We now describe how IG produces attributions.

Intuitively, as we interpolate between the baseline and the input, the prediction moves along a trajectory, from uncertainty to certainty (the final probability). At each point on this trajectory, one can use the gradient of the function

with respect to the input to attribute the change in probability back to the input variables. IG simply aggregates the gradients of the probability with respect to the input along this trajectory using a path integral.

Definition 1 (Integrated Gradients)

Given an input and baseline , the integrated gradient along the dimension is defined as follows.

(here is the gradient of along the dimension at ).

STY17 discuss several properties of IG. Here, we informally mention a few desirable ones, deferring the reader to STY17 for formal definitions.

IG satisfies the condition that the attributions sum to the difference between the probabilities at the input and the baseline. We call a variable uninfluential if all else fixed, varying it does not change the output probability. IG satisfies the property that uninfluential variables do not get any attribution. Conversely, influential variables always get some attribution. Attributions for a linear combination of two functions and are a linear combination of the attributions for and . Finally, IG satisfies the condition that symmetric variables get equal attributions.

In this work, we validate the use of IG empirically via question perturbations. We observe that perturbing high-attribution terms changes the networks’ response (sections 5.5 and 4.4). Conversely, perturbing terms that receive a low attribution does not change the network’s response (sections 5.3 and 4.3). We use these observations to craft attacks against the network by perturbing instances where generic words (e.g., “a”, “the”) receive high attribution or contentful words receive low attribution.

4 Visual Question Answering

4.1 Task, model, and data

The Visual Question Answering Task Agrawal et al. (2015); Teney et al. (2017); Kazemi and Elqursh (2017); Ben-younes et al. (2017); Zhu et al. (2016) requires a system to answer questions about images (fig. 1). We analyze the deep network from kazemi2017show. It achieves % accuracy on the validation set (the state of the art Fukui et al. (2016) achieves ). We chose this model for its easy reproducibility.

The VQA 1.0 dataset Agrawal et al. (2015) consists of 614,163 questions posed over 204,721 images (3 questions per image). The images were taken from COCO Lin et al. (2014), and the questions and answers were crowdsourced.

The network in kazemi2017show treats question answering as a classification task wherein the classes are 3000 most frequent answers in the training data. The input question is tokenized, embedded and fed to a multi-layer LSTM. The states of the LSTM attend to a featurized version of the image, and ultimately produce a probability distribution over the answer classes.

4.2 Observations

We applied IG and attributed the top selected answer class to input question words. The baseline for a given input instance is the image and an empty question222We do not black out the image in our baseline as our objective is to study the influence of just the question words for a given image. We omit instances where the top answer class predicted by the network remains the same even when the question is emptied (i.e., the baseline input). This is because IG attributions are not informative when the input and the baseline have the same prediction.

Question: how symmetrical are the white bricks on either side of the building Prediction: very Ground truth: very
Figure 1: Visual QA Kazemi and Elqursh (2017): Visualization of attributions (word importances) for a question that the network gets right. Red indicates high attribution, blue negative attribution, and gray near-zero attribution. The colors are determined by attributions normalized w.r.t the maximum magnitude of attributions among the question’s words.

A visualization of the attributions is shown in fig. 1. Notice that very few words have high attribution. We verified that altering the low attribution words in the question does not change the network’s answer. For instance, the following questions still return “very” as the answer: “how spherical are the white bricks on either side of the building”, “how soon are the bricks fading on either side of the building”, “how fast are the bricks speaking on either side of the building”.

On analyzing attributions across examples, we find that most of the highly attributed words are words such as “there”, “what”, “how”, “doing”– they are usually the less important words in questions. In section 4.3 we describe a test to measure the extent to which the network depends on such words. We also find that informative words in the question (e.g., nouns) often receive very low attribution, indicating a weakness on part of the network. In Section 4.4, we describe various attacks that exploit this weakness.

4.3 Overstability test

To determine the set of question words that the network finds most important, we isolate words that most frequently occur as top attributed words in questions. We then drop all words except these and compute the accuracy.

Figure 2 shows how the accuracy changes as the size of this isolated set is varied from 0 to 5305. We find that just one word is enough for the model to achieve more than 50% of its final accuracy. That word is “color”.

Figure 2: VQA network Kazemi and Elqursh (2017): Accuracy as a function of vocabulary size, relative to its original accuracy. Words are chosen in the descending order of how frequently they appear as top attributions. The X-axis is on logscale, except near zero where it is linear.

Note that even when empty questions are passed as input to the network, its accuracy remains at about 44.3% of its original accuracy. This shows that the model is largely reliant on the image for producing the answer.

The accuracy increases (almost) monotonically with the size of the isolated set. The top 6 words in the isolated set are “color”, “many”, “what”, “is”, “there”, and “how”. We suspect that generic words like these are used to determine the type of the answer. The network then uses the type to choose between a few answers it can give for the image.

4.4 Attacks

Attributions reveal that the network relies largely on generic words in answering questions (section 4.3). This is a weakness in the network’s logic. We now describe a few attacks against the network that exploit this weakness.

Subject ablation attack

In this attack, we replace the subject of a question with a specific noun that consistently receives low attribution across questions. We then determine, among the questions that the network originally answered correctly, what percentage result in the same answer after the ablation. We repeat this process for different nouns; specifically, “fits”, “childhood”, “copyrights”, “mornings”, “disorder”, “importance”, “topless”, “critter”, “jumper”, “tweet”, and average the result.

We find that, among the set of questions that the network originally answered correctly, 75.6% of the questions return the same answer despite the subject replacement.

Prefix attack

In this attack, we attach content-free phrases to questions. The phrases are manually crafted using generic words that the network finds important (section 4.3). Table 1 (top half) shows the resulting accuracy for three prefixes —“in not a lot of words”, “what is the answer to”, and “in not many words”. All of these phrases nearly halve the model’s accuracy. The union of the three attacks drops the model’s accuracy from % to 19%.

We note that the attributions computed for the network were crucial in crafting the prefixes. For instance, we find that other prefixes like “tell me”, “answer this” and “answer this for me” do not drop the accuracy by much; see table 1 (bottom half). The union of these three ineffective prefixes drops the accuracy from % to only 46.9%. Per attributions, words present in these prefixes are not deemed important by the network.

Prefix Accuracy
in not a lot of words 35.5%
in not many words 32.5%
what is the answer to 31.7%
Union of all three 19%
Baseline prefix
tell me 51.3%
answer this 55.7%
answer this for me 49.8%
Union of baseline prefixes 46.9%
Table 1: VQA network Kazemi and Elqursh (2017): Accuracy for prefix attacks; original accuracy is %.

4.5 Related work

agrawal2016analyzing analyze several VQA models. Among other attacks, they test the models on question fragments of telescopically increasing length. They observe that VQA models often arrive at the same answer by looking at a small fragment of the question. Our stability analysis in section 4.3 explains, and intuitively subsumes this; indeed, several of the top attributed words appear in the prefix, while important words like “color” often occur in the middle of the question. Our analysis enables additional attacks, for instance, replacing question subject with low attribution nouns. ribeiro2016nothing use a model explanation technique to illustrate overstability for two examples. They do not quantify their analysis at scale. kafle2017analysis, zhang2016yin examine the VQA data, identify deficiencies, and propose data augmentation to reduce over-representation of certain question/answer types. goyal2016making propose the VQA 2.0 dataset, which has pairs of similar images that have different answers on the same question. We note that our method can be used to improve these datasets by identifying inputs where models ignore several words. huang2017novel evaluate robustness of VQA models by appending questions with semantically similar questions. Our prefix attacks in section 4.4 are in a similar vein and perhaps a more natural and targeted approach. Finally, fong2017interpretable use saliency methods to produce image perturbations as adversarial examples; our attacks are on the question.

5 Question Answering over Tables

5.1 Task, model, and data

We now analyze question answering over tables based on the WikiTableQuestions benchmark dataset Pasupat and Liang (2015). The dataset has questions posed over tables scraped from Wikipedia. Answers are either contents of table cells or some table aggregations. Models performing QA on tables translate the question into a structured program (akin to an SQL query) which is then executed on the table to produce the answer. We analyze a model called Neural Programmer (NP) Neelakantan et al. (2017). NP is the state of the art among models that are weakly supervised, i.e., supervised using the final answer instead of the correct structured program. It achieves % accuracy on the validation set.

NP translates the input into a structured program consisting of four operator and table column selections. An example of such a program is “ (score),  (score),  (score),  (name)”, where the output is the name of the person who has the lowest score.

5.2 Observations

We applied IG to attribute operator and column selection to question words. NP preprocesses inputs and whenever applicable, appends symbols to questions that signify matches between a question and the accompanying table. These symbols are treated the same as question words. NP also computes priors for column selection using question-table matches. These vectors, and

, are passed as additional inputs to the neural network. In the baseline for IG, we use an empty question, and zero vectors for column selection priors

333Note that the table is left intact in the baseline.

Figure 3: Visualization of attributions. Question words, preprocessing tokens and column selection priors on the Y-axis. Along the X-axis are operator and column selections with their baseline counterparts in parentheses. Operators and columns not affecting the final answer, and those which are same as their baseline counterparts, are given zero attribution.

We visualize the attributions using an alignment matrix; they are commonly used in the analysis of translation models (fig. 3). Observe that the operator “” is used when the question is asking for a superlative. Further, we see that the word “gold” is a trigger for this operator. We investigate implications of this behavior in the following sections.

5.3 Overstability test

Similar to the test we did for Visual QA (section 4.3), we check for overstability in NP by looking at accuracy as a function of the vocabulary size. We treat table match annotations ,  and the out-of-vocab token () as part of the vocabulary. The results are in fig. 4. We see that the curve is similar to that of Visual QA (fig. 2). Just 5 words (along with the column selection priors) are sufficient for the model to reach more than 50% of its final accuracy on the validation set. These five words are: “many”, “number”, “”, “after”, and “total”.

Figure 4: Accuracy as a function of vocabulary size. The words are chosen in the descending order of their frequency appearance as top attributions to question terms. The X-axis is on logscale, except near zero where it is linear. Note that just 5 words are necessary for the network to reach more than 50% of its final accuracy.

5.4 Table-specific default programs

We saw in the previous section that the model relies on only a few words in producing correct answers. An extreme case of overstability is when the operator sequences produced by the model are independent of the question. We find that if we supply an empty question as an input, i.e., the output is a function only of the table, then the distribution over programs is quite skewed. We call these programs table-specific default programs. On average, about

of the selected operators match their table-default counterparts, indicating that the model relies significantly on the table for producing an answer.

For each default program, we used IG to attribute operator and column selections to column names and show ten most frequently occurring ones across tables in the validation set (table 2).

Here is an insight from this analysis: NP uses the combination “, ” to exclude the last row of the table from answer computation. The default program corresponding to “, , , ” has attributions to column names such as “rank”, “gold”, “silver”, “bronze”, “nation”, “year”. These column names indicate medal tallies and usually have a “total” row. If the table happens not to have a “total” row, the model may produce an incorrect answer.

Operator sequence # Triggers Insights
, , , 109 [, date, position, points, name, competition, notes, no, year, venue] sports
, , , 68 [, rank, total, bronze, gold, silver, nation, name, date, no] medal tallies
, , , 29 [name, , notes, year, nationality, rank, location, date, comments, hometown] player rankings
, , , 25 [notes, date, title, , role, genre, year, score, opponent, event] awards
, , , 17 [year, height, , name, position, floors, notes, jan, jun, may] building info.
, , , 14 [opponent, date, result, location, rank, site, attendance, notes, city, listing] politics
, , , 10 [, name, year, edition, birth, death, men, time, women, type] census
Table 2: Attributions to column names for table-specific default programs (programs returned by NP on empty input questions). See supplementary material, table 6 for the full list. These results are indication that the network is predisposed towards picking certain operators solely based on the table.

We now describe attacks that add or drop content-free words from the question, and cause NP to produce the wrong answer. These attacks leverage the attribution analysis.

5.5 Attacks

Question concatenation attacks

In these attacks, we either suffix or prefix content-free phrases to questions. The phrases are crafted using irrelevant trigger words for operator selections (supplementary material, table 5). We manually ensure that the phrases are content-free.

Attack phrase Prefix Suffix in not a lot of words 20.6% 10.0% if its all the same 21.8% 18.7% in not many words 15.6% 11.2% one way or another 23.5% 20.0% Union of above attacks 3.3% Baseline please answer 32.3% 30.7% do you know 31.2% 29.5% Union of baseline prefixes 27.1%

Table 3: Neural Programmer Neelakantan et al. (2017): Left: Validation accuracy when attack phrases are concatenated to the question. (Original: %)

Table 3 describes our results. The first phrases use irrelevant trigger words and result in a large drop in accuracy. For instance, the first phrase uses “not” which is a trigger for “”, “”, and “”, and the second uses “same” which is a trigger for “” and “”. The four phrases combined results in the model’s accuracy going down from % to . The first two phrases alone drop the accuracy to .

The next set of phrases use words that receive low attribution across questions, and are hence non-triggers for any operator. The resulting drop in accuracy on using these phrases is relatively low. Combined, they result in the model’s accuracy dropping from to .

Stop word deletion attacks

We find that sometimes an operator is selected based on stop words like: “a”, “at”, “the”, etc. For instance, in the question, “what ethnicity is at the top?”, the operator “” is triggered on the word “at”. Dropping the word “at” from the question changes the operator selection and causes NP to return the wrong answer.

We drop stop words from questions in the validation dataset that were originally answered correctly and test NP on them. The stop words to be dropped were manually selected444We avoided standard stop word lists (e.g. NLTK) as they contain contentful words (e.g “after”) which may be important in some questions (e.g. “who ranked right after turkey?”) and are shown in Figure 5 in the supplementary material.

By dropping stop words, the accuracy drops from % to . Selecting operators based on stop words is not robust. In real world search queries, users often phrase questions without stop words, trading grammatical correctness for conciseness. For instance, the user may simply say “top ethnicity”. It may be possible to defend against such examples by generating synthetic training data, and re-training the network on it.

Row reordering attacks

We found that NP often got the question right by leveraging artifacts of the table. For instance, the operators for the question “which nation earned the most gold medals” are “”, “”, “” and “”. The “” operator essentially excludes the last row from the answer computation. It gets the answer right for two reasons: (1) the answer is not in the last row, and (2) rows are sorted by the values in the column “gold”.

In general, a question answering system should not rely on row ordering in tables. To quantify the extent of such biases, we used a perturbed version of WikiTableQuestions validation dataset as described in pasupat2016inferring555based on data at https://nlp.stanford.edu/software/sempre/wikitable/dpd/ and evaluated the existing NP model on it (there was no re-training involved here). We found that NP has only accuracy on it, in constrast to an accuracy of % on the original validation dataset.

One approach to making the network robust to row-reordering attacks is to train against perturbed tables. This may also help the model generalize better. Indeed, mudrakarta2018training note that the state-of-the-art strongly supervised666supervised on the structured program model on WikiTableQuestions  Krishnamurthy et al. (2017) enjoys a gain in its final accuracy by leveraging perturbed tables during training.

6 Reading Comprehension

Question AddSent attack that does not work Attack that works
Who was Count of Melfi Jeff Dean was the mayor of Bracco. Jeff Dean was the mayor of Melfi.
What country was Abhisit Vejjajiva prime minister of , despite having been born in Newcastle ? Samak Samak was prime minister of the country of Chicago, despite having been born in Leeds. Abhisit Vejjajiva was chief minister of the country of Chicago, despite having been born in Leeds.
Where according to gross state product does Victoria rank in Australia ? According to net state product, Adelaide ranks 7 in New Zealand According to net state product, Adelaide ranked 7 in Australia. (as a prefix)
When did the Methodist Protestant Church split from the Methodist Episcopal Church ? The Presbyterian Catholics split from the Presbyterian Anglican in 1805. The Methodist Protestant Church split from the Presbyterian Anglican in 1805. (as a prefix)
What period was 2.5 million years ago ? The period of Plasticean era was 2.5 billion years ago. The period of Plasticean era was 1.5 billion years ago. (as a prefix)
Table 4: AddSent attacks that failed to fool the model. With modifications to preserve nouns with high attributions, these are successful in fooling the model. Question words that receive high attribution are colored red (intensity indicates magnitude).

6.1 Task, model, and data

The reading comprehension task involves identifying a span from a context paragraph as an answer to a question. The SQuAD dataset Rajpurkar et al. (2016) for machine reading comprehension contains 107.7K query-answer pairs, with 87.5K for training, 10.1K for validation, and another 10.1K for testing. Deep learning methods are quite successful on this problem, with the state-of-the-art F1 score at 84.6 achieved by YuDLLZC2018; we analyze their model.

6.2 Analyzing adversarial examples

Recall the adversarial attacks proposed by JL17 for reading comprehension systems. Their attack AddSent appends sentences to the paragraph that resemble an answer to the question without changing the ground truth. See the second column of table 4 for a few examples.

We investigate the effectiveness of their attacks using attributions. We analyze examples generated by the AddSent method in JL17, and find that an adversarial sentence is successful in fooling the model in two cases:

First, a contentful word in the question gets low/zero attribution and the adversarially added sentence modifies that word. E.g. in the question, “Who did Kubiak take the place of after Super Bowl XXIV?”, the word “Super” gets low attribution. Adding “After Champ Bowl XXV, Crowton took the place of Jeff Dean” changes the prediction for the model. Second, a contentful word in the question that is not present in the context. For e.g. in the question “Where hotel did the Panthers stay at?”, “hotel”, is not present in the context. Adding “The Vikings stayed at Chicago hotel.” changes the prediction for the model.

On the flip side, an adversarial sentence is unsuccessful when a contentful word in the question having high attribution is not present in the added sentence. E.g. for “Where according to gross state product does Victoria rank in Australia?”, “Australia” receives high attribution. Adding “According to net state product, Adelaide ranks 7 in New Zealand.” does not fool the model. However, retaining “Australia” in the adversarial sentence does change the model’s prediction.

6.3 Predicting the effectiveness of attacks

Next we correlate attributions with efficacy of the AddSent attacks. We analyzed 1000 (question, attack phrase) instances777data sourced from https://worksheets.codalab.org/worksheets/0xc86d3ebe69a3427d91f9aaa63f7d1e7d/ where YuDLLZC2018 model has the correct baseline prediction. Of the 1000 cases, 508 are able to fool the model, while 492 are not. We split the examples into two groups. The first group has examples where a noun or adjective in the question has high attribution, but is missing from the adversarial sentence and the rest are in the second group. Our attribution analysis suggests that we should find more failed examples in the first group. That is indeed the case. The first group has failed examples, while the second has only .

Recall that the attack sentences were constructed by (a) generating a sentence that answers the question, (b) replacing all the adjectives and nouns with antonyms, and named entities by the nearest word in GloVe word vector space Pennington et al. (2014) and (c) crowdsourcing to check that the new sentence is grammatically correct. This suggests a use of attributions to improve the effectiveness of the attacks, namely ensuring that question words that the model thinks are important are left untouched in step (b) (we note that other changes in should be carried out). In table 4, we show a few examples where an original attack did not fool the model, but preserving a noun with high attribution did.

7 Conclusion

We analyzed three question answering models using an attribution technique. Attributions helped us identify weaknesses of these models more effectively than conventional methods (based on validation sets). We believe that a workflow that uses attributions can aid the developer in iterating on model quality more effectively.

While the attacks in this paper may seem unrealistic, they do expose real weaknesses that affect the usage of a QA product. Under-reliance on important question terms is not safe. We also believe that other QA models may share these weaknesses. Our attribution-based methods can be directly used to gauge the extent of such problems. Additionally, our perturbation attacks (sections 4.4 and 5.5) serve as empirical validation of attributions.

Reproducibility

Code to generate attributions and reproduce our results is freely available at https://github.com/pramodkaushik/acl18_results.

Acknowledgments

We thank the anonymous reviewers and Kevin Gimpel for feedback on our work, and David Dohan for helping with the reading comprehension network. We are grateful to Jiří Šimša for helpful comments on drafts of this paper.

References

Appendix A Supplementary Material

show, tell, did, me, my, our, are, is, were, this, on, would, and, for, should, be, do, I, have, had, the, there, look, give, has, was, we, get, does, a, an, ’s, that, by, based, in, of, bring, with, to, from, whole, being, been, want, wanted, as, can, see, doing, got, sorted, draw, listed, chart, only

Figure 5: Neural Programmer Neelakantan et al. (2017): List of stop words used in stop word deletion attacks (section 5.5). NP’s accuracy falls from to on deleting stop words from questions in the validation set.
Operator Triggers
[tm_token, many, how, number, or, total, after, before, only]
[before, many, than, previous, above, how, at, most]
[tm_token, first, before, after, who, previous, or, peak]
[many, total, how, number, last, least, the, first, of]
[many, how, number, total, of, difference, between, long, times]
[after, not, many, next, same, tm_token, how, below]
[last, or, after, tm_token, next, the, chart, not]
[most, cm_token, same]
[least, the, not]
[most, largest]
[at, more, least, had, over, number, than, many]
[tm_token]
Table 5: Neural Programmer Neelakantan et al. (2017): Operator triggers. Notice that there are several irrelevant triggers (highlighted in red). For instance, “many” is irrelevant to “”. See Section 5.5 for attacks exploiting this weakness.
Operator sequence #tables Triggers
, , , 109 [, date, position, points, name, competition, notes, no, year, venue]
, , , 68 [, rank, total, bronze, gold, silver, nation, name, date, no]
, , , 29 [name, , notes, year, nationality, rank, location, date, comments, hometown]
, , , 25 [notes, date, title, , role, genre, year, score, opponent, event]
, , , 17 [year, height, , name, position, floors, notes, jan, jun, may]
, , , 14 [opponent, date, result, location, rank, site, attendance, notes, city, listing]
, , , 10 [, name, year, edition, birth, death, men, time, women, type]
, , , 9 [date, , distance, location, name, year, winner, japanese, duration, member]
, , , 7 [name, notes, intersecting, kilometers, location, athlete, nationality, rank, time, design]
, , , 7 [, ethnicity]
, , , 6 [place, season, , date, division, tier, builder, cylinders, notes, withdrawn]
, , , 5 [report, date, average, chassis, driver, race, builder, name, notes, works]
, , , 4 [division, level, movements, position, season, current, gauge, notes, wheel, works]
, , , 3 [car, finish, laps, led, rank, retired, start, year, , carries]
, , , 2 [, network, owner, programming]
, , , 1 [candidates, district, first, incumbent, party, result]
, , , 1 [lifetime, name, nationality, notable, notes]
, , , 1 [, english, japanese, type, year]
, , , 1 [, comment]
, , , 1 [, length, performer, producer, title]
Table 6: Neural Programmer Neelakantan et al. (2017): Column attributions in occurring in table-specific default programs, classified by operator sequence. These attributions are indication that the network is predisposed towards picking certain operators solely based on the table. Here is an insight: the second row of this table indicates that the network is inclined to choosing “, ” on tables about medal tallies. It may have learned this bias as some medal tables in training data may have “total” rows which may confound answer computation if not excluded by applying “, ”. However, not all medal tables have “total” rows, and hence “, ” must not be applied universally. As the operators are triggered by column names and not by table entries, the network cannot distinguish between tables with and without “total” row and may erroneously exclude the last rows.