Log In Sign Up

Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations

by   Camburu Oana-Maria, et al.

To increase trust in artificial intelligence systems, a growing amount of works are enhancing these systems with the capability of producing natural language explanations that support their predictions. In this work, we show that such appealing frameworks are nonetheless prone to generating inconsistent explanations, such as "A dog is an animal" and "A dog is not an animal", which are likely to decrease users' trust in these systems. To detect such inconsistencies, we introduce a simple but effective adversarial framework for generating a complete target sequence, a scenario that has not been addressed so far. Finally, we apply our framework to a state-of-the-art neural model that provides natural language explanations on SNLI, and we show that this model is capable of generating a significant amount of inconsistencies.


page 1

page 2

page 3

page 4


I don't understand! Evaluation Methods for Natural Language Explanations

Explainability of intelligent systems is key for future adoption. While ...

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Recently, an increasing number of works have introduced models capable o...

CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

Providing explanations in the context of Visual Question Answering (VQA)...

Investigating the Effect of Natural Language Explanations on Out-of-Distribution Generalization in Few-shot NLI

Although neural models have shown strong performance in datasets such as...

What Gets Echoed? Understanding the "Pointers" in Explanations of Persuasive Arguments

Explanations are central to everyday life, and are a topic of growing in...

Generating Justifications for Norm-Related Agent Decisions

We present an approach to generating natural language justifications of ...

AI pptX: Robust Continuous Learning for Document Generation with AI Insights

Business analysts create billions of slide decks, reports and documents ...

1 Introduction

For machine learning systems to be widely adopted in practice, they need to be trusted by users 

(Dzindolet et al., 2003)

. However, the black-box nature of neural networks can create doubt or lack of trust, especially since recent works show that highly accurate models can heavily rely on annotation artifacts 

(Gururangan et al., 2018; Chen et al., 2016). In order to increase users’ trust in these systems, a growing number of works (Camburu et al., 2018; Park et al., 2018; Kim et al., 2018; Hendricks et al., 2016; Ling et al., 2017) enhance neural networks with an explanation generation module that is jointly trained to produce natural language explanations for their final decisions. The supervision on the explanations usually comes from human-provided explanations for the ground-truth answers.

In this work, we first draw attention to the fact that the explanation module may generate inconsistent explanations. For example, a system that generates “Snow implies outdoors” for justifying one prediction, and “Snow implies indoors” for justifying another prediction would likely decrease users’ trust in the system. We note that, while users may already decrease their trust in a model that generates incorrect statements, such as “Snow implies indoors”, if these statements are consistent over the input space, the users might, at least, be reassured that the explanations are a good reflection of the inner workings of the model. Subsequently, they may not trust the model when it is applied on certain concepts, such as the snow’s location, but they may trust the model on other concepts where it has shown a persistently correct understanding.

Generating Adversarial Explanations.

Adversarial examples (Szegedy et al., 2014) are inputs that have been specifically designed by an adversary to cause a machine learning algorithm to produce an incorrect answer Biggio et al. (2013). In this work, we focus on the problem of generating adversarial explanations. More specifically, given a machine learning model that can jointly produce predictions and their explanations, we propose a framework that can identify inputs that cause the model to generate mutually inconsistent explanations.

To this date, most of the research on adversarial examples in computer vision focuses on generating adversarial perturbations that are imperceptible to humans, but make the machine learning model to produce a different prediction 

Goodfellow et al. (2018)

. Similarly, in natural language processing, most of the literature focuses on identifying semantically invariant modifications of natural language sentences that cause neural models to change their predictions 

Zhang et al. (2019a).

Our problem has three desired properties that make it different from commonly researched adversarial setups:

  1. The model has to generate a complete target sequence, i.e., the attack is considered successful if the model generates an explanation that is inconsistent with a given generated explanation. This is more challenging than the adversarial setting commonly addressed in sequence-to-sequence models, where the objective is generating sequences characterized by the presence or absence of certain given tokens (Cheng et al., 2018; Zhao et al., 2018).

  2. The adversarial input does not have to be a paraphrase or a small perturbation of the original input, since our objective is generating mutually inconsistent explanations and not a label attack. 111Ideally, the explanation and predicted label align, but in general it may not be the case.

  3. We strongly prefer the adversarial inputs to be grammatically correct English sentences — in previous works, this requirement never appears jointly with the aforementioned two requirements.

To our knowledge, our work is the first to tackle this problem setting, especially due to the complete target requirement, which is a challenging requirement for sequence generation. The simple yet effective framework that we introduce for the above scenario consists of training a neural network, which we call ReverseJustifier, to invert the explanation module, i.e., to find an input for which the model will produce a given explanation. We further create simple rules to construct a set of potentially inconsistent explanations, and query the ReverseJustifier  model for inputs that could lead the original model to generate these adversarial explanations. When applied to the best explanation model from Camburu et al. (2018)

, our procedure detects an estimated

distinct pairs of inconsistencies on the e-SNLI test set.

2 The e-SNLI Dataset

The natural language inference task consists in detecting whether a pair of sentences, called premise and hypothesis, are in a relation of: entailment, if the premise entails the hypothesis; contradiction, if the premise contradicts the hypothesis; or neutral, if neither entailment nor contradiction holds. The SNLI corpus (Bowman et al., 2015) of K such human-written instances enabled a plethora of works on this task (Rocktäschel et al., 2015; Munkhdalai and Yu, 2016; Liu et al., 2016). Recently, Camburu et al. (2018) augmented SNLI with crowd-sourced free-form explanations of the ground-truth label, called e-SNLI. Their best model for generating explanations, called ExplainThenPredictAttention (hereafter called ETPA

), is a sequence-to-sequence attention model that uses two bidirectional LSTM networks 

(Hochreiter and Schmidhuber, 1997) for encoding the premise and hypothesis, and an LSTM decoder for generating the explanation while separately attending over the tokens of the premise and hypothesis. Furthermore, they predict the label solely based on the explanation via a separately trained neural network, which maps an explanation to a label. In our work, we show that our simple attack on the explanation generation network is able to detect a significant amount of inconsistent explanation generated by ETPA

. We highlight that our final goal is not the label attack, even if for this particular model, since the label is predicted solely from the explanation, we implicitly also have a label attack with high probability.

222The explanation-to-label model had a test accuracy of .

3 Method

We define two explanations to be inconsistent if they provide logically contradictory arguments. For example, “Seagulls are birds.” and “Seagulls and birds are different animals.” 333This is a real example of an inconsistency detected by our method. are inconsistent explanations. Our baseline method consists of the following 5 steps:

  1. Reverse the explanation module by training a ReverseJustifier model to map from a generated explanation to an input that causes the model to generate this explanation.

  2. For each originally generated explanation by the ETPA, generate a list of statements that are inconsistent with the this explanation — we call them adversarial explanations.

  3. Query the ReverseJustifier model on each adversarial explanation to get what we will call reverse inputs — i.e., inputs that may cause the model to produce adversarial explanations.

  4. Feed the reverse inputs into the original model to get the reverse explanations.

  5. Check if any of the reverse explanations are indeed inconsistent with the original one.

In the following, we detail how we instantiate our procedure on e-SNLI.

4 Experiments

In this work, we use the trained ETPA model444From: from Camburu et al. (2018), which gave the highest percentage of correct explanations (). In our experiments, for the ReverseJustifier

 model, we use the same neural network architecture and hyperparameters used by

Camburu et al. (2018) for their attention model, with the difference that inputs are now premise-explanation pairs rather than premise-hypothesis pairs, and outputs are hypotheses rather than explanations. Given a premise and an explanation, our ReverseJustifier model is able to reconstruct the correct hypothesis of the times on the e-SNLI test set. We found it satisfactory to reverse only the hypothesis; however, it is possible to jointly reverse both premise and hypothesis, which may result in detecting more inconsistencies due to the exploration of a larger portion of the input space.

To perform Step 2, we note that the explanations in e-SNLI naturally follow label-specific templates. For example, annotators often used “One cannot [X] and [Y] simultaneously” to justify a contradiction, “Just because [X], doesn’t mean [Y]” for neutral, or “[X] implies [Y]” for entailment. Since two labels are mutually exclusive, transforming an explanation from one template to a template of another label automatically creates an inconsistency. For example, for the explanation of the contradiction “One cannot eat and sleep simultaneously”, we match [X]=“eat” and [Y]=“sleep”, and we create the inconsistent explanation “Eat implies sleep” using the entailment template “[X] implies [Y]”. We note that this type of rule-based procedure is not applicable only to e-SNLI. Since explanations are by nature logical sentences, for any task, one may define a set of rules that the explanations should adhere to. For example, for explanations in self-driving cars (Kim et al., 2018), one can interchange “green light” with “red light”, or “stop” with “accelerate”, to get inconsistent — and potentially hazardous! — explanations such as “The car accelerates, because it is red light“. Similarly, in law applications, one can interchange “guilty” with “innocent”, or “arrest” with “release”. Therefore, our rule-based generation strategy — and the whole framework — can be applied to any task where one is required to test its explanations against an essential set of predefined task-specific inconsistencies, and our paper encourages the community to consider such hazardous inconsistencies for their tasks.

To summarize, on e-SNLI, we first created, for each label, a list of the most used templates that we manually identified by inspecting the human annotated explanations. We provide the lists of templates in Section A.1. We then proceeded as follows: for each explanation generated by ETPA on the SNLI test set, we first reversed negations (if applicable) by simply removing the “not” and “n’t” tokens.555During pre-processing, the tokenizer splits words such as “don’t” into two tokens: “do” and “n’t”. Secondly, we tried to match the explanation to a template. If there was no negation and no template match, we discarded the instance. We only discarded of the SNLI test set in this way. If a template was found, we identified its associated label and retrieved the matched substrings [X] and [Y]. For each of the templates associated with the two other labels different from , we substituted [X] and [Y] with the corresponding strings. We note that this procedure may result in grammatically or semantically incorrect adversarial explanations, especially since we did not perform any linguistic-specific adjustments. However, our ReverseJustifier turned out to perform well in smoothing out these errors and in generating grammatically correct reverse hypotheses. This is not surprising, since it has been trained to output the ground-truth correct hypothesis. Specifically, we manually annotated random instances of reversed hypotheses generated by ReverseJustifier and found to be both grammatically and semantically valid sentences.

For each adversarial explanation, we queried the ReverseJustifier module and subsequently fed each obtained reverse hypothesis back to the ETPA  model to get the reverse explanation. To check whether the reverse explanation was inconsistent with the original one, we again used the list of adversarial explanations generated at Step 2 and checked for an exact string match. It is likely that, at this step, we discarded a large amount of inconsistencies, due to insignificant syntactic differences. However, when an exact match was found, i.e., a potential inconsistency, it is very likely to be a true inconsistency. Indeed, we manually annotated a random sample of pairs of potential inconsistencies and found to be true inconsistencies.

More precisely, our procedure first identified a total of pairs of potential inconsistencies for the ETPA model applied on the test set of e-SNLI. However, multiple distinct reverse hypotheses gave rise to the same reverse explanation. On average, we found that there are distinct reverse hypotheses giving rise to the same reverse explanation. Therefore, we counted a total of distinct pairs of potentially inconsistent explanations. Given our estimation of to be true inconsistencies, we obtained a total of distinct true inconsistencies. While this means that our procedure only has a success rate of , it is nonetheless alarming that this very simple, under-optimized framework detects a significant amount of inconsistencies on a model trained on K instances.

In Table 1, we can see three examples of true inconsistencies detected by our procedure and one example of a false inconsistency. In Example (3), we notice that the incorrect explanation was actually given on the original hypothesis.

(1) Premise: A guy in a red jacket is snowboarding in midair.
(a) Original Hypothesis: A guy is outside in the snow.
Predicted Label: entailment
Original explanation: Snowboarding is done outside.
(b) Reverse Hypothesis: The guy is outside.
Predicted label: contradiction
Reverse explanation: Snowboarding is not done outside.
(2) Premise: A man talks to two guards as he holds a drink.
(a) Original Hypothesis: The prisoner is talking to two guards in the
  prison cafeteria.
Predicted Label: neutral
Original explanation: The man is not necessarily a prisoner.
(b) Reverse Hypothesis: A prisoner talks to two guards.
Predicted Label: entailment
Reverse explanation: A man is a prisoner.
(3) Premise: A woman in a black outfit lies face first on a yoga mat; several paintings
are hanged on the wall, and the sun shines through a large window near her.
(a) Original Hypothesis: There is a person in a room.
Predicted label: contradiction
Original explanation: A woman is not a person.
(b) Reverse Hypothesis: A person is on a yoga mat.
Predicted label: entailment
Reverse explanation: A woman is a person.
(4) Premise: A female acrobat with long, blond curly hair, dangling upside down
while suspending herself from long, red ribbons of fabric.
(a) Original Hypothesis: A horse jumps over a fence.
Predicted label: contradiction
Original explanation: A female is not a horse.
(b) Reverse Hypothesis: The female has a horse.
Predicted label: neutral
Reverse explanation: Not all female have a horse.
Table 1: Examples of three true detected inconsistencies (1)–(3) and one false detected inconsistency (4).

Manual Scanning.

Finally, we were curious to what extent a simple manual scanning would find inconsistent explanations in the e-SNLI test set alone. We performed two such experiments. First, we manually analyzed the first instances in the test set without finding any inconsistency. However, these examples were involving different concepts, thus decreasing the likelihood of finding inconsistencies. To account for this, in our second experiment, we constructed three groups around the concepts of woman, prisoner, and snowboarding, by simply selecting the explanations in the test set containing these words. We selected these concepts, because our framework detected inconsistencies about them — examples are listed in Table 1.

For woman, we obtained examples, and we looked at a random sample of among which we did not find any inconsistency. For snowboarding, we found examples in the test set and again no inconsistency among them. For prisoner, we only found one instance in the test set, so we had no ways to find out that the model is inconsistent with respect to this concept simply by scanning the test set.

We only looked at the test set for a fair comparison with our method that was only applied on this set. However, we highlight that manual scanning should not be regarded as a proper baseline, since it does not bring the same benefits as our framework. Indeed, manual scanning requires considerable human effort to look over a large set of explanations and find if any two are inconsistent.666Even a group of only explanations required non-negligible time. Moreover, restricting ourselves to the instances in the original dataset would clearly be less effective than being able to generate new instances from the input distribution. Our framework addresses these issues and provides direct pairs of very likely (approx. ) inconsistent explanations. Nonetheless, we considered this experiment useful for illustrating that the explanation module does not provide inconsistent explanations in a frequent manner. In fact, during our scanning over explanations, we also experimented with a few manually created potential adversarial hypothesis from Carmona et al. (2018). We were pleased to notice a good level of robustness against inconsistencies. For example, for the neutral pair (premise: “A bird is above water.”, hypothesis: “A swan is above water.”), we get the explanation “Not all birds are a swan.”, while when interchanging bird with swan (premise: “A swan is above water.”, hypothesis: “A bird is above water.”), ETPA states that “A swan is a bird.” Similarly, interchanging “child” with “toddler” in (premise: “A small child watches the outside world through a window.”, hypothesis: “A small toddler watches the outside world through a window.”) does not confuse the networks, which outputs “Not every child is a toddler.” and “A toddler is a small child.”, respectively. Further investigation on whether the networks can be tricked on concepts where it seems to exhibit robustness, such as toddler or swan, are left for future work.

5 Related Work

Explanatory Methods.

Explaining predictions made by complex machine learning systems has been of increasing concern Doshi-Velez and Kim (2017). These explanations can be divided into two categories: feature importance explanations and full-sentence natural language explanations. The methods that provide feature importance explanations Ribeiro et al. (2016); Lundberg and Lee (2017); Chen et al. (2018); Li et al. (2016); Feng et al. (2018) aim to provide the user with the subset of input tokens that contributed the most to the prediction of the model. As pointed out by Camburu et al. (2018), these explanations are not comprehensive, as one would need to infer the missing links between the words in order to form a complete argument. For example, in the natural language inference task, if the explanation is formed by the words “dog” and “animal”, one would not know if the model learned that “A dog is an animal” or “An animal is a dog” or maybe even that “Dog and animal implies entailment”. It is also arguably more user-friendly to get a full sentence explanation rather than a set of tokens. Therefore, an increasing amount of works focus on providing full sentence explanations Camburu et al. (2018); Kim et al. (2018); Hendricks et al. (2016). However, generating fluent argumentation, while more appealing, it is also arguably a harder and more risky task. For example, similar in spirit to our work, Hendricks et al. (2017) identified the risk of mentioning attributes from a strong class prior without any evidence being present in the input. In our work, we bring awareness to the risk of generating inconsistent explanations.

Generating Adversarial Examples.

Generating adversarial examples has received increasing attention in natural language processing Zhang et al. (2019b); Wang et al. (2019). However, most works in this space build on the requirement that the adversarial input should be a small perturbation (Belinkov and Bisk, 2017; Hosseini et al., 2017) or be preserving the main semantics (Iyyer et al., 2018) of the original input, but leading to a different prediction. While this is necessary for testing the stability of a model, our goal does not require the adversarial input to be semantically equivalent to the original, and any pair of correct English inputs that causes the model to produce inconsistent explanations suffices. On the other hand, the aforementioned models do not always require the adversarial input to be grammatically correct, and often they can change words or characters to completely random ones (Cheng et al., 2018). This assumption is acceptable for certain use cases, such as summarization of long pieces of text, where changing a few words would likely not change the main flow of the text. However, in our case, the inputs are short sentences and the model is being tested for robustness in fine-grained reasoning and common-sense knowledge, therefore it is more desirable to test the model on grammatically correct sentences.

Most importantly, to our knowledge, no previous adversarial attack for sequence-to-sequence models produces a complete target sequence. The closest to this goal, Cheng et al. (2018) requires the presence of certain tokens anywhere in the target sequence. They only test with up to 3 required tokens, and their success rate dramatically drops from for 1 required token to for 3 tokens for the task of summarization. Similarly, Zhao et al. (2018) proposed an adversarial framework for obtaining only the presence or absence of certain tokens in the target sequence for the task of machine translation. Our scenario would require as many tokens as the desired adversarial explanation, and we also additionally need them to be in a given order, thus tackling a much challenging task.

Finally, Minervini and Riedel (2018) attempted to find inputs where a model trained on SNLI violates a set of logical constraints. This scenario may in theory lead to also finding inputs that lead to inconsistent explanations. However, their method needs to enumerate and evaluate a potentially very large set of perturbations of the inputs, obtained by, e.g., removing sub-trees or replacing tokens with their synonyms. While they succeed in finding adversarial examples, finding exact inconsistent explanations is a harder task, and hence their approach would be significantly more computationally challenging. Additionally, their perturbations are rule-based, and hence can easily generate incorrect English text. Moreover, their scenario does not require addressing the question of automatically producing undesired — in our case inconsistent — sequences.

Therefore, our work introduces a new practical attack scenario, and proposes a simple yet effective procedure, which we hope will be further improved by the community.

6 Summary and Outlook

In this work, we identified an essential shortcoming of the class of models that produce natural language explanations for their own decisions: the fact that such models are prone to producing inconsistent explanations, which can undermine users’ trust in the model. We introduced a framework for identifying pairs of inconsistent explanations. We instantiated our procedure on the best explanation model available in the literature on e-SNLI, and obtained a significant amount of inconsistencies generated by this model.

The concern that we raise is general and can have a large practical impact. For example, humans would likely not accept a self-driving car if its explanation module — for example, the one proposed by Kim et al. (2018) — is prone to state that “The car accelerates, because there is a red light at the intersection”.

Future work will focus on two directions: developing more advanced procedures for detecting inconsistencies, and preventing the explanation modules from generating such inconsistencies.


This work was supported by JP Morgan PhD Fellowship 2019-2020 and by the Alan Turing Institute under the EPSRC grant EP/N510129/1, and EPSRC grant EP/R013667/1.


  • Y. Belinkov and Y. Bisk (2017)

    Synthetic and natural noise both break neural machine translation

    CoRR abs/1711.02173. Cited by: §5.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Srndic, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In ECML/PKDD (3), Lecture Notes in Computer Science, Vol. 8190, pp. 387–402. Cited by: §1.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. CoRR abs/1508.05326. External Links: Link, 1508.05326 Cited by: §2.
  • O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom (2018) E-SNLI: Natural language inference with natural language explanations. In NeurIPS, pp. 9560–9572. Cited by: §1, §1, §2, §4, §5.
  • V. I. S. Carmona, J. Mitchell, and S. Riedel (2018) Behavior analysis of NLI models: uncovering the influence of three factors on robustness. In NAACL-HLT, pp. 1975–1985. Cited by: §4.
  • D. Chen, J. Bolton, and C. D. Manning (2016) A thorough examination of the cnn/daily mail reading comprehension task. CoRR abs/1606.02858. External Links: Link, 1606.02858 Cited by: §1.
  • J. Chen, L. Song, M. Wainwright, and M. Jordan (2018) Learning to explain: an information-theoretic perspective on model interpretation. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholm, Sweden, pp. 883–892. External Links: Link Cited by: §5.
  • M. Cheng, J. Yi, H. Zhang, P. Chen, and C. Hsieh (2018) Seq2Sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR abs/1803.01128. External Links: Link, 1803.01128 Cited by: item 1, §5, §5.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv. Cited by: §5.
  • M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck (2003) The role of trust in automation reliance. Int. J. Hum.-Comput. Stud. 58 (6), pp. 697–718. External Links: ISSN 1071-5819, Link, Document Cited by: §1.
  • S. Feng, E. Wallace, A. G. II, M. Iyyer, P. Rodriguez, and J. L. Boyd-Graber (2018) Pathologies of neural models make interpretation difficult. In EMNLP, pp. 3719–3728. Cited by: §5.
  • I. J. Goodfellow, P. D. McDaniel, and N. Papernot (2018) Making machine learning robust against adversarial inputs. Commun. ACM 61 (7), pp. 56–66. Cited by: §1.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proc. of NAACL, Cited by: §1.
  • L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell (2016) Generating visual explanations. In ECCV (4), LNCS, Vol. 9908, pp. 3–19. Cited by: §1, §5.
  • L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata (2017) Grounding visual explanations (extended abstract). CoRR abs/1711.06465. External Links: Link, 1711.06465 Cited by: §5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §2.
  • H. Hosseini, B. Xiao, and R. Poovendran (2017) Deceiving Google’s cloud video intelligence API built for summarizing videos. In CVPR Workshops, pp. 1305–1309. Cited by: §5.
  • M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer (2018) Adversarial example generation with syntactically controlled paraphrase networks. CoRR abs/1804.06059. Cited by: §5.
  • J. Kim, A. Rohrbach, T. Darrell, J. F. Canny, and Z. Akata (2018) Textual explanations for self-driving vehicles. In ECCV (2), Lecture Notes in Computer Science, Vol. 11206, pp. 577–593. Cited by: §1, §4, §5, §6.
  • J. Li, W. Monroe, and D. Jurafsky (2016) Understanding neural networks through representation erasure. CoRR abs/1612.08220. Cited by: §5.
  • W. Ling, D. Yogatama, C. Dyer, and P. Blunsom (2017) Program induction by rationale generation: learning to solve and explain algebraic word problems. CoRR abs/1705.04146. External Links: Link, 1705.04146 Cited by: §1.
  • Y. Liu, C. Sun, L. Lin, and X. Wang (2016) Learning natural language inference using bidirectional LSTM model and inner-attention. CoRR abs/1605.09090. External Links: Link, 1605.09090 Cited by: §2.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pp. 4765–4774. External Links: Link Cited by: §5.
  • P. Minervini and S. Riedel (2018) Adversarially regularising neural NLI models to integrate logical background knowledge. In CoNLL, pp. 65–74. Cited by: §5.
  • T. Munkhdalai and H. Yu (2016) Neural semantic encoders. CoRR abs/1607.04315. External Links: Link, 1607.04315 Cited by: §2.
  • D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach (2018) Multimodal explanations: justifying decisions and pointing to the evidence. CoRR abs/1802.08129. External Links: Link, 1802.08129 Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    “Why should I trust you?": explaining the predictions of any classifier

    In KDD, pp. 1135–1144. Cited by: §5.
  • T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kociský, and P. Blunsom (2015) Reasoning about entailment with neural attention. CoRR abs/1509.06664. External Links: Link, 1509.06664 Cited by: §2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR (Poster), Cited by: §1.
  • W. Wang, B. Tang, R. Wang, L. Wang, and A. Ye (2019) A survey on adversarial attacks and defenses in text. CoRR abs/1902.07285. Cited by: §5.
  • W. E. Zhang, Q. Z. Sheng, and A. A. F. Alhazmi (2019a)

    Generating textual adversarial examples for deep learning models: A survey

    CoRR abs/1901.06796. Cited by: §1.
  • W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li (2019b) Adversarial attacks on deep learning models in natural language processing: a survey. Vol. abs/1901.06796. External Links: 1901.06796, Link Cited by: §5.
  • Z. Zhao, D. Dua, and S. Singh (2018) Generating natural adversarial examples. In ICLR (Poster), Cited by: item 1, §5.

Appendix A Supplemental Material

a.1 Entailment Templates

List of manually created templates for generating inconsistent explanations. “token1/token2” means that a separate sentence has been generated for each of the tokens. [X] and [Y] are the key elements that we want to identify and use in the other templates in order to create inconsistencies. […] is a placeholder for any string, and its value is not relevant.

  • [X] is/are a type of [Y]

  • [X] implies [Y]

  • [X] is/are the same as [Y]

  • [X] is a rephrasing of [Y]

  • [X] is a another form of [Y]

  • [X] is synonymous with [Y]

  • [X] and [Y] are synonyms/synonymous

  • [X] can be [Y]

  • [X] and [Y] is/are the same thing

  • [X] then [Y]

  • [X] if [X] , then [Y]

  • [X] so [Y]

  • [X] must be [Y]

  • [X] has to be [Y]

  • [X] is/are [Y]

Neutral Templates

  • not all [X] are/have [Y]

  • not every [X] is/has [Y]

  • just because [X] does not/n’t mean/imply [Y]

  • [X] is/are not necessarily [Y]

  • [X] does not/n’t have to be [Y]

  • [X] does not/n’t imply/mean [Y]

Contradiction Templates

  • […] cannot/can not/can n’t [X] and [Y] at the same time/simultaneously

  • […] cannot/can not/can n’t [X] and at the same time [Y]

  • [X] is/are not (the) same as [Y]

  • […] is/are either [X] or [Y]

  • [X] is/are not [Y]

  • [X] is/are the opposite of [Y]

  • […] cannot/can not/can n’t [X] if [Y]

  • [X] is/are different than [Y]

  • [X] and [Y] are different […]