Current natural language inference (NLI) is typically conceived as a three-way classification problem, taking samples such as 1, consisting of a premise (P) and a hypothesis (H), and requiring the models to categorize them as either contradiction (P and H cannot both be true), entailment (If P is true, H must be true as well), or neutral (neither of the two, that is, given the truth of P, H may or may not be true; this is the case with example 1).
. (P) The streets are wet. (H) It has rained.
Transformer-Based pre-trained language models (PLMs) have become the de facto standard in a variety of natural language understanding (NLU) tasks, including NLI. Based on the encoding part of the original transformer architecture (Vaswani et al., 2017), researchers have proposed a number of highly successful NLU architectures, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), DeBERTa (He et al., 2020), and smaller versions such as DistilBERT and DistilRoBERTa (Sanh et al., 2019), MiniLM (Wang et al., 2020) and Albert (Lan et al., 2019)
. Additionally, a number of sequence-to-sequence architectures have been proposed that are more similar to the original transformer than to BERT in that they directly try to transform one sequence to another, much like the basic set-up of neural machine translation. These include T5(Raffel et al., 2019) and BART (Lewis et al., 2020). These PLMs are now regularly outperforming the human benchmark, as evinced by the GLUE and SuperGLUE Leaderboards, (Wang et al., 2018) and (Wang et al., 2019).
While it is impossible to deny the performance of these models at such benchmarks, it is another question whether this performance is driven by simple shallow heuristics or by any real understanding of the tasks that they are performing. Indeed, there is a growing consensus in the literature that purely neural NLI approaches suffer from the use of shallow heuristics and, as consequence, a lack of generalization beyond the fine-tuning dataset, see Zhou and Bansal (2020), Bras et al. (2020), Utama et al. (2020), Asael et al. (2021), He et al. (2019), Mahabadi et al. (2019), and Bernardy and Chatzikyriakidis (2019).
This study contributes to this analysis of shallow heuristics used by state of the art models using the systematic of syllogistic figures and moods, which marks the beginning of formal logic in the west, but which has not been put to use in current NLI to expose shallow heuristics. We develop a synthetic test dataset based on this systematic and use it to evaluate the n NLI capacities of transformer-based PLMs that have been fine-tuned on the common NLI datasets, in particular SNLI(Bowman et al., 2015) and MNLI (Williams et al., 2018).
We begin by reviewing the stat of the art (section 2), then we discuss both the MNLI and the SNLI datasets as well as our own synthetic dataset built on syllogistic (section 3). We then evaluate a number of PLMs that have been fine-tuned on the former datasets and discuss the results (sections 4 & 5).
2 Previous Research
Geirhos et al. (2020) have proposed a general diagnosis of the problem of shallow heuristics, and Ribeiro et al. (2020) have urged a more comprehensive, multi-dimensional approach to testing the abilities of these models instead of simply submitting them to automated benchmarks. Niven and Kao (2019) consider the almost-human performance of BERT at the Argument Reasoning comprehension task, finding that the models exploit “spurious statistical cues in the dataset” to reach this performance; they develop a dataset similar to the original one at which the PLMs do not perform better than random.
With regard to the main datasets, Gururangan et al. (2018) show that SNLI and, to a lesser extent, MNLI, contain cues that make it possible to achieve very good accuracy in categorizing hypotheses by only looking at the hypotheses. Hossain et al. (2020) show that simply ignoring negation does not substantially decrease model performance in many NLI datasets. Bernardy and Chatzikyriakidis (2019) argue that both SNLI and MNLI only cover a part of the entire range of human reasoning. In particular, they suggest that they do not cover quantifiers, nor strict logical inference.
With regard to PLMs fine-tuned on these datasets, Wallace et al. (2019) show that certain triggers, inserted context-independently, lead to a stark decline in NLI accuracy. Morris et al. (2020) provide a systematic framework to create adversarial attacks for NLU. Chien and Kalita (2020) focus on syntactic biases for models fine-tuned on SNLI and MNLI, also finding that these biases are strong.
McCoy et al. (2019) hypothesize that state-of-the-art NNLP models use three kinds of heuristic at NLI tasks: the lexical overlap heuristics (which focuses on the overall lexical overlap between premise and hypothesis), the subsequence heuristics (focusing on hypotheses that are a subsequence of supposed premises), and constituent heuristic (focusing on syntactically discrete units of the premise as hypothesis). Using their dataset called HANS that is designed so that any use of these three heuristics results in mistaken predictions, they find that state-of-the-art models in NLI do make many such mistakes, suggesting that they are indeed using these heuristics.
Richardson et al. (2020), finally, use cleverly chosen semantic fragments (i.e. subsets of a language translatable into formal logic, in particular first-order predicate logic) to test the models’ understanding of the logical relationships of contradiction, entailment and neutral. They find that the models tested perform poorly on these tasks, but that this performance can be remedied with fine-tuning the models on sufficient amounts of training data that has been synthetically generated from these fragments.
In non-automated logical analysis of language, the dominant approach proceeds by using predicate logic, as it has been pioneered by Frege (1892) and developed by Russell (1905). For a recent contribution in this tradition with a focus on computability, see Moot and Retoré (2019). We will rely on an older way to formalize logical relationships: syllogistic. It dates back to Aristotle (1984), and it has been somewhat sidelined by predicate logic recently. However, it has one staunch proponent in Oderberg (2005). For our purposes, what makes syllogistic so attractive is that, as we shall see, it allows to synthesize large amounts of valid and invalid samples. This is because syllogistic aims at systematically finding all valid inference patterns containing three predicates (a subject-, middle- and predicate-term), which are then combined via quantifiers into three statements (a conclusion and two premises).
3.1 The SNLI & MNLI Datasets
Given the importance of fine-tuning for the entire method as it is currently practiced, it is clear that this method is squarely based on the availability – and quality – of large NLI datasets.
The main datasets in use currently for neural NLI are the Stanford Natural Language Inference corpus (SNLI, Bowman et al. (2015)) and the Multi Genre Natural Language Inference corpus (MNLI, Williams et al. (2018)). The main difference between the two corpora is that SNLI derives exclusively from image captions, while MNLI is sourced from 10 different text genres (the two corpora are approximately the same size of some 500k samples). The following discussion will therefore focus on MNLI, being basically a more diverse and (by design) more challenging version of SNLI.
Williams et al. (2018) have presented crowdworkers with some 433k statements serving as premises, and then asked them to write up one sentence that is entailed by this premise, one that contradicts it, and one that is neutral to it. The instructions given to the crowdworkers are given in full in the appendix. To ensure maximal diversity of styles, the premises originate from ten different genres. Williams et al. (2018, 1114-5) emphasizes that only minimal preprocessing has occurred, filtering duplicates within a genre, sentences with less than eight characters, and manually removing non-narrative writing such as formulae.
In the following, we highlight two important aspects of the dataset.
A consequence of the diversity of genres and the near-absence of preprocessing in MNLI is that the corpus contains premises such as 3.1.
. iuh-huh how about any matching programs
It is incoherent to say that questions entail any other statements: to entail something, a statement has to have determinate truth conditions; questions are textbook cases of sentences that have no determinate truth conditions. Questions might presuppose other statements for their appropriateness (asking “Is it going to rain today?” is inappropriate if the question is uttered in pouring rain), and they might implicate other statements (asking “Are you hungry?” typically implicates that one is ready to offer some food). As a consequence, premises such as 3.1 cannot entail or contradict any other statements; they are alogical. Therefore, every premise-hypothesis-pair that contains 3.1 as a premise should be labelled neutral, as the relationship is neither entailment nor contradiction.
While this might be a rather extreme case, the basic problem is inseparable from the goal of the MNLI dataset, namely to accurately “represent the full range of American English” (Williams et al., 2018, 1114). By randomly choosing statements from genres such as conversations, one inevitably ends up with a number of statements that do not “describe a situation or event” in the typical sense, indeed, that are alogical insofar as they cannot stand in entailment or contradiction relations to other statements.
However, as the crowdworkers were instructed to write sentences resulting in entailment and contradiction pairs for each premise without exception, this leads to numerous pairs that should be labelled neutral but are in fact labelled contradiction or entailment.
We therefore hypothesize that a model fine-tuned on such a dataset would struggle to recognize neutral pairs correctly, as it has been fed with neutral pairs that have wrongly been labelled as contradiction or entailment.
It is well-documented that the dataset has a negation bias. This is pointed out by Williams et al. (2018) themselves: in 48% of cases involving a negation, the correct label is contradiction. It is very likely that this bias results from crowdworker tactics: There is no more efficient way to create a sentence that contradicts any given premise by simply negating the premise.
A model fine-tuned on a dataset with such a negation bias would be expected to wrongly label pairs as contradictions just because either the hypothesis or the premise, but not both, contain a negation. The same effect is not to be expected if both premise and hypothesis contain negations, as such samples would not be expected to have been encountered too often during training: unlike a negation in premise or hypothesis, negations in both of them cannot be the consequence of crowdworkers’ taking a shortcut when creating contradiction pairs.
3.2 The Syllogistic Dataset
While it has so far not been used to assess NLI capacities of NLU models, the systematic behind our dataset dates back to Aristotle. In his Prior Analytics (composed around 350 BC), Aristotle (1984, book 1) diligently analyzes the possible combinations of subject-, predicate-, and middle-term via quantifiers and negations to form a number of formally valid inferences. He deduces 24 formally valid patterns of inferences, so-called syllogisms. For instance, consider the three sentences 3.2, 3.2, and 3.2.
. All residents of California are residents of the USA.
. All residents of Los Angeles are residents of California.
. All residents of Los Angeles are residents of the USA.
These three sentences together form a formally valid inference: If you accept 3.2 and 3.2 as true, on pain of self-contradiction, you must also accept 3.2 as true. In the systematic of syllogistic, it is a mood of the first figure that goes by the name of “BARBARA”, the capital “A” signifying affirmative general assertions (“All X are Y.”). Note that the individual truth of either one of the sentences is entirely irrelevant for the formal validity of the inference.
Now, consider the formal logical relationship between 3.2 and 3.2 on the one hand and 3.2 on the other hand. By changing one single word, three letters in total, we have switched the relationship from entailment to contradiction: it is not possible that 3.2, 3.2, and 3.2 are all true.
. All residents of California are residents of the USA.
. All residents of Los Angeles are residents of California.
. No residents of Los Angeles are residents of the USA.
Finally, consider the formal logical relationship between 3.2 and 3.2 on the one hand and 3.2 on the other hand. By changing one word, four letters, we switched the relationship from entailment to neutral: If 3.2 and 3.2 are both true, 3.2 may or may not be true.
. All residents of California are residents of the USA.
. Some residents of Los Angeles are residents of California.
. All residents of Los Angeles are residents of the USA.
We are using a total of 12 formally valid syllogisms – called BARBARA, CELARENT, DARII, FERIO, CESARE, CAMESTRES, FESTINO, BAROCO, DISAMIS, DATISI, BOCARDO, FERISON – and we manually develop 24 patterns that are very similar to these 12 syllogisms, but where the first and the second sentence together contradict or are neutral to the third sentence. This yields a total of 36 patterns, 12 of which are valid syllogisms, 12 are contradictory, and 12 are neutral. To fit the premise-hypothesis structure expected by the models, we combine premise one and two to form a single premise.
We then use a pre-compiled list of occupations, hobbies, and nationalities to fill the subject- middle- and predicate-terms in these patterns. Using 15 of each of them and combining them with the 36 pattern yields 121500 test cases in total, each consisting of a premise and a hypothesis. For a fully specified sample that instantiates BARBARA, a mood of the first figure, see example LABEL:ex:fullyspecified.
. (P) All Gabonese are Budget analysts, and all Element collectors are Gabonese. (H) All Element collectors are Budget analysts.
We run a total of seven models on our test dataset, all of which are fine-tuned on standard NLI datasets, namely SNLI and MNLI (see table 1 for details: PLMs marked with one star “*” have only been fine-tuned on MNLI, PLMs marked with two stars have been fine-tuned on both SNLI and MNLI). The models are provided by https://huggingface.co (Wolf et al., 2019), three of them by textattack (Morris et al., 2020), and four by Cross Encoder (Reimers and Gurevych, 2019).
The models’ performances on MNLI, per our own evaluation (not all of the models provide evaluation scores, and we did not find precise documentation on how the scores were obtained), are given in table 1, for details of the evaluation, see the appendix, section B.
The basic idea behind the experiment is to assess whether the PLMs’ performance on our dataset reveals any shallow heuristics learned by the models during fine-tuning on MNLI and SNLI.
The results of our experiments are shown in figure 1. For instance, the model whose performance is represented on the very left, textattack’s fine-tuned version of BART large, predicts the correct label in only 7% of cases for neutral labels, while doing so in 95% for entailment samples and still 83% for neutral labels.
Figure 1 shows clearly that the models’ predictions are quite accurate for labels entailment and contradiction, but very poor for neutral. In table 2, we therefore focus on the top three models’ performance for neutral samples. The table shows that the two smaller models have a preference for entailment, while the largest model tested has a slight preference for contradiction regarding the neutral models. None of the models achieves accuracy of more than 10%, which is far below pure guessing (33.3%).
Overall, textattack’s distilbert leads the field with a accuracy of 65%, which might surprising just because it was among the smallest models evaluated here. However, there is growing evidence that NLI cannot be solved by simply increasing model size. Researchers at DeepMind find that larger models tend to generalize worse, not better, when it comes to tasks involving logical relationships. The large study by Rae et al. (2021, 23) strongly suggests that, in the words of the authors, “the benefits of scale are nonuniform”, and that logical and mathematical reasoning does not improve when scaling up to the gigantic size of Gopher, a model having 280B parameters (in contrast, Gopher sets a new SOTA with many other NLU tasks such as RACE-h and RACE-m, where it outperforms GPT-3 by some 25% in accuracy).
On the face of it, 65% looks like an excellent score for a model that has not been trained on the rather challenging dataset that we are using. However, when looking closer at the data, we find that the model performs very poorly with neutral samples; indeed, none of the models is able to recognize such neutral relationships with a accuracy of more than 10%. Given that pure chance would still yield an accuracy of some 33%, this is a very poor performance.
We have therefore further probed the heuristics that the models might be using that could cause the poor performance with neutral labels. Manual inspection showed that they respond strongly to symmetries regarding quantifiers and negations between premises and hypotheses. In particular, if either both or none of the premise and the hypothesis contain a “some” (existential quantifier) or a negation (the symmetric conditions), then the models are strongly biased to predict entailment (see figure 2).
Conversely, if the pattern contains an asymmetry regarding existential quantifier and negation between premise and hypothesis, then the models are very strongly inclined to predict contradiction (see figure 3).
We have been surprised to see that the models are sensitive not only to symmetry regarding the negation particle, but also regarding the quantifier.
In the case of the contradiction and entailment pairs, these heuristics serve the models very well in our dataset, resulting in impressive performance. However, when applied to the neutral samples, the heuristics break down, performance falls far below simple guessing.
The observations made regarding the MNLI dataset (see above, section 3.1), together with the overlap biases observed by McCoy et al. (2019) can to some extent help to explain the models’ behavior with neutral samples. Our observations regarding the features of the MNLI dataset, the alogical premises and the negation bias, suggest that a model fine-tuned on this dataset would struggle to identify neutral samples as such: If alogical premises are also part of contradiction and entailment pairs, and if negated sentences generally indicate negation, then the models would be expected to struggle to identify neutral samples that contain negation and that are also for humans somewhat difficult to identify as such.
Furthermore, as the hypotheses overlap almost entirely with the premises in our neutral samples, the biases observed by McCoy et al. (2019) would be expected to contribute to this failure to identify the neutral samples as such.
One could wonder whether these results are of any significance for the models investigated. One could argue that it is no surprise that the models perform poorly on samples that differ relevantly from the ones that they have seen during training. Hence, it is to be expected – and no reason for concern – that the models fail to perform well at these tasks.
In response to this, we would like to raise attention to the fact that the task is called natural language inference, and the concepts that the models are intended to learn are entailment and contradiction. If it turns out, as we suggest it has in our study and in similar ones referenced above, section 2, that the models’ performance collapses even though the same logical relationships are in focus, while some superficial cues have disappeared, then it is wrong to say that the models are addressing the task of NLI, or that they are learning the logical concepts. Rather, the models are picking up spurious statistical cues that are correlated to the logical concepts in some dataset such as MNLI, but that are entirely unrelated to them in our dataset.
In other words, we suggest that the current lack of generalization beyond the training dataset that we can observe in our study (but which is also more widely acknowledged, see the references in section 1 and 2) is indeed a reason for concern. It implies that the models do not actually learn NLI but rather the exploitation of spurious statistical cues in the dataset, leading to shallow heuristics.
So far, we have investigated the ability of state-of-the-art transformer-based PLMs fine-tuned on the common NLI datasets to recognize formal logical relationships in syllogistic patterns. Our results show that the models are very good at distinguishing entailment from contradiction, but very bad, much worse than chance, at distinguishing either from neutral. Our analysis has suggested that this is due to the PLMs’ use of shallow heuristics, in particular with the attention to symmetries regarding negation and quantifiers between premises and hypotheses. We suggest that our study adds to the evidence that current transformer-based models do not actually learn NLI.
- Aristotle (1984) Aristotle. 1984. Prior analytics. In Jonathan Barnes, editor, The Complete Works of Aristotle, pages 39–113. Oxford University Press.
- Asael et al. (2021) Dimion Asael, Zachary Ziegler, and Yonatan Belinkov. 2021. A generative approach for mitigating structural biases in natural language inference. arXiv preprint arXiv:2108.14006.
- Bernardy and Chatzikyriakidis (2019) Jean-Philippe Bernardy and Stergios Chatzikyriakidis. 2019. What kind of natural language inference are nlp systems learning: Is this enough? In ICAART (2), pages 919–931.
Bowman et al. (2015)
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.
A large annotated corpus for learning natural language inference.
Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 632–642. Association for Computational Linguistics (ACL).
- Bras et al. (2020) Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial filters of dataset biases. arXiv preprint arXiv:2002.04108.
- Chien and Kalita (2020) Tiffany Chien and Jugal Kumar Kalita. 2020. Adversarial analysis of natural language inference systems. 2020 IEEE 14th International Conference on Semantic Computing (ICSC), pages 1–8.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
- Frege (1892) Gottlob Frege. 1892. Über sinn und bedeutung. Zeitschrift für Philosophie und philosophische Kritik, 100:25–50.
Geirhos et al. (2020)
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel,
Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020.
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673.
- Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112.
- He et al. (2019) He He, Sheng Zha, and Haohan Wang. 2019. Unlearn dataset bias in natural language inference by fitting the residual. arXiv preprint arXiv:1908.10763.
- He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- Hossain et al. (2020) Md Mosharaf Hossain, Venelin Kovatchev, Pranoy Dutta, Tiffany Kao, Elizabeth Wei, and Eduardo Blanco. 2020. An analysis of natural language inference benchmarks through the lens of negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9106–9118.
Lan et al. (2019)
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. 2019.
Albert: A lite bert for self-supervised learning of language representations.In International Conference on Learning Representations.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv, abs/1910.13461.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Mahabadi et al. (2019) Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson. 2019. End-to-end bias mitigation by modelling biases in corpora. arXiv preprint arXiv:1909.06321.
- McCoy et al. (2019) Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448.
- Moot and Retoré (2019) Richard Moot and Christian Retoré. 2019. Natural language semantics and computability. Journal of Logic, Language and Information, 28(2):287–307.
- Morris et al. (2020) John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
- Niven and Kao (2019) Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664.
- Oderberg (2005) David S. Oderberg. 2005. Predicate logic and bare particulars. In David S. Oderberg, editor, The Old New Logic, pages 183–210. The MIT Press.
- Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, and Trevor Cai et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. DeepMind Company Publication.
- Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118.
Richardson et al. (2020)
Kyle Richardson, Hai Hu, Lawrence Moss, and Ashish Sabharwal. 2020.
Probing natural language inference models through semantic fragments.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8713–8721.
- Russell (1905) Bertrand Russell. 1905. On denoting. Mind, 14(56):479–493.
- Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Utama et al. (2020) Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. 2020. Towards debiasing nlu models from unknown biases. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7597–7610.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Wang et al. (2020) Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763.
- Zhou and Bansal (2020) Xiang Zhou and Mohit Bansal. 2020. Towards robustifying nli models against lexical dataset biases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8759–8771.
Appendix A Full Instructions Given to Crowdworkers
Williams et al. (2018, 1114) specifies the following tasks for the crowdworkers:
“This task will involve reading a line from a non-fiction article and writing three sentences that relate to it. The line will describe a situation or event. Using only this description and what you know about the world:
Write one sentence that is definitely correct about the situation or event in the line.
Write one sentence that might be correct about the situation or event in the line.
Write one sentence that is definitely incorrect about the situation or event in the line. ”
Appendix B Method used for evaluation of Models on MNLI
To evaluate the models, we have used Huggingface’s trainer API, see Huggingface (Wolf et al., 2019). In particular, we followed the instructions in the notebook here. We evaluated the models using the API out-of-the-box, with the following exceptions:
The textattack-models had as labels "LABEL_0, LABEL_1, LABEL_2", which could not be read by the function that ensures that the labels are used equivalently by both model and dataset; hence, we reconfigured the models to use as labels “contradiction, entailment, neutral”.
facebook-bart-large-mnli by textattack posed two additional challenges.
Due to out of memory issues, we had to split up processing of the validation set into three chunks, averaging the accuracy received afterwards.
The logits containing the predictions issued by facebook-bart-large-mnli could not be processed by the evaluation function, which caused the need to select only the first slice of the tensor that the model was issuing, ensuring that the metric function got a 1-dimensional tensor to compute accuracy.