Log In Sign Up

Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns

In this article, we explore the shallow heuristics used by transformer-based pre-trained language models (PLMs) that are fine-tuned for natural language inference (NLI). To do so, we construct or own dataset based on syllogistic, and we evaluate a number of models' performance on our dataset. We find evidence that the models rely heavily on certain shallow heuristics, picking up on symmetries and asymmetries between premise and hypothesis. We suggest that the lack of generalization observable in our study, which is becoming a topic of lively debate in the field, means that the PLMs are currently not learning NLI, but rather spurious heuristics.


page 1

page 2

page 3

page 4


Lexical Generalization Improves with Larger Models and Longer Training

While fine-tuned language models perform well on many tasks, they were a...

Exploring Software Naturalness throughNeural Language Models

The Software Naturalness hypothesis argues that programming languages ca...

ConjNLI: Natural Language Inference Over Conjunctive Sentences

Reasoning about conjuncts in conjunctive sentences is important for a de...

Accelerating Pre-trained Language Models via Calibrated Cascade

Dynamic early exiting aims to accelerate pre-trained language models' (P...

Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics

Much of recent progress in NLU was shown to be due to models' learning d...

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Leader-boards like SuperGLUE are seen as important incentives for active...

Modeling Event Plausibility with Consistent Conceptual Abstraction

Understanding natural language requires common sense, one aspect of whic...

1 Introduction

Current natural language inference (NLI) is typically conceived as a three-way classification problem, taking samples such as 1, consisting of a premise (P) and a hypothesis (H), and requiring the models to categorize them as either contradiction (P and H cannot both be true), entailment (If P is true, H must be true as well), or neutral (neither of the two, that is, given the truth of P, H may or may not be true; this is the case with example 1).

. (P) The streets are wet. (H) It has rained.

Transformer-Based pre-trained language models (PLMs) have become the de facto standard in a variety of natural language understanding (NLU) tasks, including NLI. Based on the encoding part of the original transformer architecture (Vaswani et al., 2017), researchers have proposed a number of highly successful NLU architectures, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), DeBERTa (He et al., 2020), and smaller versions such as DistilBERT and DistilRoBERTa (Sanh et al., 2019), MiniLM (Wang et al., 2020) and Albert (Lan et al., 2019)

. Additionally, a number of sequence-to-sequence architectures have been proposed that are more similar to the original transformer than to BERT in that they directly try to transform one sequence to another, much like the basic set-up of neural machine translation. These include T5

(Raffel et al., 2019) and BART (Lewis et al., 2020). These PLMs are now regularly outperforming the human benchmark, as evinced by the GLUE and SuperGLUE Leaderboards, (Wang et al., 2018) and (Wang et al., 2019).

While it is impossible to deny the performance of these models at such benchmarks, it is another question whether this performance is driven by simple shallow heuristics or by any real understanding of the tasks that they are performing. Indeed, there is a growing consensus in the literature that purely neural NLI approaches suffer from the use of shallow heuristics and, as consequence, a lack of generalization beyond the fine-tuning dataset, see Zhou and Bansal (2020), Bras et al. (2020), Utama et al. (2020), Asael et al. (2021), He et al. (2019), Mahabadi et al. (2019), and Bernardy and Chatzikyriakidis (2019).

This study contributes to this analysis of shallow heuristics used by state of the art models using the systematic of syllogistic figures and moods, which marks the beginning of formal logic in the west, but which has not been put to use in current NLI to expose shallow heuristics. We develop a synthetic test dataset based on this systematic and use it to evaluate the n NLI capacities of transformer-based PLMs that have been fine-tuned on the common NLI datasets, in particular SNLI

(Bowman et al., 2015) and MNLI (Williams et al., 2018).

We begin by reviewing the stat of the art (section 2), then we discuss both the MNLI and the SNLI datasets as well as our own synthetic dataset built on syllogistic (section 3). We then evaluate a number of PLMs that have been fine-tuned on the former datasets and discuss the results (sections 4 & 5).

2 Previous Research

Geirhos et al. (2020) have proposed a general diagnosis of the problem of shallow heuristics, and Ribeiro et al. (2020) have urged a more comprehensive, multi-dimensional approach to testing the abilities of these models instead of simply submitting them to automated benchmarks. Niven and Kao (2019) consider the almost-human performance of BERT at the Argument Reasoning comprehension task, finding that the models exploit “spurious statistical cues in the dataset” to reach this performance; they develop a dataset similar to the original one at which the PLMs do not perform better than random.

With regard to the main datasets, Gururangan et al. (2018) show that SNLI and, to a lesser extent, MNLI, contain cues that make it possible to achieve very good accuracy in categorizing hypotheses by only looking at the hypotheses. Hossain et al. (2020) show that simply ignoring negation does not substantially decrease model performance in many NLI datasets. Bernardy and Chatzikyriakidis (2019) argue that both SNLI and MNLI only cover a part of the entire range of human reasoning. In particular, they suggest that they do not cover quantifiers, nor strict logical inference.

With regard to PLMs fine-tuned on these datasets, Wallace et al. (2019) show that certain triggers, inserted context-independently, lead to a stark decline in NLI accuracy. Morris et al. (2020) provide a systematic framework to create adversarial attacks for NLU. Chien and Kalita (2020) focus on syntactic biases for models fine-tuned on SNLI and MNLI, also finding that these biases are strong.

McCoy et al. (2019) hypothesize that state-of-the-art NNLP models use three kinds of heuristic at NLI tasks: the lexical overlap heuristics (which focuses on the overall lexical overlap between premise and hypothesis), the subsequence heuristics (focusing on hypotheses that are a subsequence of supposed premises), and constituent heuristic (focusing on syntactically discrete units of the premise as hypothesis). Using their dataset called HANS that is designed so that any use of these three heuristics results in mistaken predictions, they find that state-of-the-art models in NLI do make many such mistakes, suggesting that they are indeed using these heuristics.

Richardson et al. (2020), finally, use cleverly chosen semantic fragments (i.e. subsets of a language translatable into formal logic, in particular first-order predicate logic) to test the models’ understanding of the logical relationships of contradiction, entailment and neutral. They find that the models tested perform poorly on these tasks, but that this performance can be remedied with fine-tuning the models on sufficient amounts of training data that has been synthetically generated from these fragments.

In non-automated logical analysis of language, the dominant approach proceeds by using predicate logic, as it has been pioneered by Frege (1892) and developed by Russell (1905). For a recent contribution in this tradition with a focus on computability, see Moot and Retoré (2019). We will rely on an older way to formalize logical relationships: syllogistic. It dates back to Aristotle (1984), and it has been somewhat sidelined by predicate logic recently. However, it has one staunch proponent in Oderberg (2005). For our purposes, what makes syllogistic so attractive is that, as we shall see, it allows to synthesize large amounts of valid and invalid samples. This is because syllogistic aims at systematically finding all valid inference patterns containing three predicates (a subject-, middle- and predicate-term), which are then combined via quantifiers into three statements (a conclusion and two premises).

3 Datasets

3.1 The SNLI & MNLI Datasets

Given the importance of fine-tuning for the entire method as it is currently practiced, it is clear that this method is squarely based on the availability – and quality – of large NLI datasets.

The main datasets in use currently for neural NLI are the Stanford Natural Language Inference corpus (SNLI, Bowman et al. (2015)) and the Multi Genre Natural Language Inference corpus (MNLI, Williams et al. (2018)). The main difference between the two corpora is that SNLI derives exclusively from image captions, while MNLI is sourced from 10 different text genres (the two corpora are approximately the same size of some 500k samples). The following discussion will therefore focus on MNLI, being basically a more diverse and (by design) more challenging version of SNLI.

Williams et al. (2018) have presented crowdworkers with some 433k statements serving as premises, and then asked them to write up one sentence that is entailed by this premise, one that contradicts it, and one that is neutral to it. The instructions given to the crowdworkers are given in full in the appendix. To ensure maximal diversity of styles, the premises originate from ten different genres. Williams et al. (2018, 1114-5) emphasizes that only minimal preprocessing has occurred, filtering duplicates within a genre, sentences with less than eight characters, and manually removing non-narrative writing such as formulae.

In the following, we highlight two important aspects of the dataset.

Alogical Premises

A consequence of the diversity of genres and the near-absence of preprocessing in MNLI is that the corpus contains premises such as 3.1.

. iuh-huh how about any matching programs

It is incoherent to say that questions entail any other statements: to entail something, a statement has to have determinate truth conditions; questions are textbook cases of sentences that have no determinate truth conditions. Questions might presuppose other statements for their appropriateness (asking “Is it going to rain today?” is inappropriate if the question is uttered in pouring rain), and they might implicate other statements (asking “Are you hungry?” typically implicates that one is ready to offer some food). As a consequence, premises such as 3.1 cannot entail or contradict any other statements; they are alogical. Therefore, every premise-hypothesis-pair that contains 3.1 as a premise should be labelled neutral, as the relationship is neither entailment nor contradiction.

While this might be a rather extreme case, the basic problem is inseparable from the goal of the MNLI dataset, namely to accurately “represent the full range of American English” (Williams et al., 2018, 1114). By randomly choosing statements from genres such as conversations, one inevitably ends up with a number of statements that do not “describe a situation or event” in the typical sense, indeed, that are alogical insofar as they cannot stand in entailment or contradiction relations to other statements.

However, as the crowdworkers were instructed to write sentences resulting in entailment and contradiction pairs for each premise without exception, this leads to numerous pairs that should be labelled neutral but are in fact labelled contradiction or entailment.

We therefore hypothesize that a model fine-tuned on such a dataset would struggle to recognize neutral pairs correctly, as it has been fed with neutral pairs that have wrongly been labelled as contradiction or entailment.

Negation Bias

It is well-documented that the dataset has a negation bias. This is pointed out by Williams et al. (2018) themselves: in 48% of cases involving a negation, the correct label is contradiction. It is very likely that this bias results from crowdworker tactics: There is no more efficient way to create a sentence that contradicts any given premise by simply negating the premise.

A model fine-tuned on a dataset with such a negation bias would be expected to wrongly label pairs as contradictions just because either the hypothesis or the premise, but not both, contain a negation. The same effect is not to be expected if both premise and hypothesis contain negations, as such samples would not be expected to have been encountered too often during training: unlike a negation in premise or hypothesis, negations in both of them cannot be the consequence of crowdworkers’ taking a shortcut when creating contradiction pairs.

3.2 The Syllogistic Dataset

While it has so far not been used to assess NLI capacities of NLU models, the systematic behind our dataset dates back to Aristotle. In his Prior Analytics (composed around 350 BC), Aristotle (1984, book 1) diligently analyzes the possible combinations of subject-, predicate-, and middle-term via quantifiers and negations to form a number of formally valid inferences. He deduces 24 formally valid patterns of inferences, so-called syllogisms. For instance, consider the three sentences 3.2, 3.2, and 3.2.

. All residents of California are residents of the USA.

. All residents of Los Angeles are residents of California.

. All residents of Los Angeles are residents of the USA.

These three sentences together form a formally valid inference: If you accept 3.2 and 3.2 as true, on pain of self-contradiction, you must also accept 3.2 as true. In the systematic of syllogistic, it is a mood of the first figure that goes by the name of “BARBARA”, the capital “A” signifying affirmative general assertions (“All X are Y.”). Note that the individual truth of either one of the sentences is entirely irrelevant for the formal validity of the inference.

Now, consider the formal logical relationship between 3.2 and 3.2 on the one hand and 3.2 on the other hand. By changing one single word, three letters in total, we have switched the relationship from entailment to contradiction: it is not possible that 3.2, 3.2, and 3.2 are all true.

. All residents of California are residents of the USA.

. All residents of Los Angeles are residents of California.

. No residents of Los Angeles are residents of the USA.

Finally, consider the formal logical relationship between 3.2 and 3.2 on the one hand and 3.2 on the other hand. By changing one word, four letters, we switched the relationship from entailment to neutral: If 3.2 and 3.2 are both true, 3.2 may or may not be true.

. All residents of California are residents of the USA.

. Some residents of Los Angeles are residents of California.

. All residents of Los Angeles are residents of the USA.

We are using a total of 12 formally valid syllogisms – called BARBARA, CELARENT, DARII, FERIO, CESARE, CAMESTRES, FESTINO, BAROCO, DISAMIS, DATISI, BOCARDO, FERISON – and we manually develop 24 patterns that are very similar to these 12 syllogisms, but where the first and the second sentence together contradict or are neutral to the third sentence. This yields a total of 36 patterns, 12 of which are valid syllogisms, 12 are contradictory, and 12 are neutral. To fit the premise-hypothesis structure expected by the models, we combine premise one and two to form a single premise.

We then use a pre-compiled list of occupations, hobbies, and nationalities to fill the subject- middle- and predicate-terms in these patterns. Using 15 of each of them and combining them with the 36 pattern yields 121500 test cases in total, each consisting of a premise and a hypothesis. For a fully specified sample that instantiates BARBARA, a mood of the first figure, see example LABEL:ex:fullyspecified.

. (P) All Gabonese are Budget analysts, and all Element collectors are Gabonese. (H) All Element collectors are Budget analysts.

4 Experiment

We run a total of seven models on our test dataset, all of which are fine-tuned on standard NLI datasets, namely SNLI and MNLI (see table 1 for details: PLMs marked with one star “*” have only been fine-tuned on MNLI, PLMs marked with two stars have been fine-tuned on both SNLI and MNLI). The models are provided by (Wolf et al., 2019), three of them by textattack (Morris et al., 2020), and four by Cross Encoder (Reimers and Gurevych, 2019).

The models’ performances on MNLI, per our own evaluation (not all of the models provide evaluation scores, and we did not find precise documentation on how the scores were obtained), are given in table 1, for details of the evaluation, see the appendix, section B.

Modelname N-Par. MNLI-Matched (Acc.)
textattack-facebookbart-large-MNLI* 406M 0.8887
nli-crossencoder-deberta-base** 123M 0.8824
cross-encoder-nli-roberta-base** 123M 0.8733
cross-encoder-nliMiniLM2-L6-H768** 66M 0.86602
textattack-bert-base-uncased-MNLI* 109M 0.8458
nli-crossencoder-distilroberta-base** 82M 0.8364
textattack-distilbertbase-uncased-MNLI* 66M 0.8133
Table 1: Performance of the models in focus on the MNLI-Matched validation set. PLMs marked with one star “*” have only been fine-tuned on MNLI, PLMs marked with two stars have been fine-tuned on both SNLI and MNLI.

The basic idea behind the experiment is to assess whether the PLMs’ performance on our dataset reveals any shallow heuristics learned by the models during fine-tuning on MNLI and SNLI.

5 Results

The results of our experiments are shown in figure 1. For instance, the model whose performance is represented on the very left, textattack’s fine-tuned version of BART large, predicts the correct label in only 7% of cases for neutral labels, while doing so in 95% for entailment samples and still 83% for neutral labels.

Figure 1: Performance on our syllogistic dataset.

Figure 1 shows clearly that the models’ predictions are quite accurate for labels entailment and contradiction, but very poor for neutral. In table 2, we therefore focus on the top three models’ performance for neutral samples. The table shows that the two smaller models have a preference for entailment, while the largest model tested has a slight preference for contradiction regarding the neutral models. None of the models achieves accuracy of more than 10%, which is far below pure guessing (33.3%).

Modelname contrad. entailm. neutral
textattack-distilbertbase-uncased-MNLI 23.68 68.01 8.31
textattack-facebookbart-large-MNLI 50.64 42.67 6.69
cross-encoder-nliMiniLM2-L6-H768 35.41 64.49 0.1
Table 2: For true label neutral, this table gives the percentages of predicted labels for our three best-performing models. For instance, textattack’s distilbert erroneously predicts contradiction in 23.68% of all neutral pairs.

6 Discussion

Overall, textattack’s distilbert leads the field with a accuracy of 65%, which might surprising just because it was among the smallest models evaluated here. However, there is growing evidence that NLI cannot be solved by simply increasing model size. Researchers at DeepMind find that larger models tend to generalize worse, not better, when it comes to tasks involving logical relationships. The large study by Rae et al. (2021, 23) strongly suggests that, in the words of the authors, “the benefits of scale are nonuniform”, and that logical and mathematical reasoning does not improve when scaling up to the gigantic size of Gopher, a model having 280B parameters (in contrast, Gopher sets a new SOTA with many other NLU tasks such as RACE-h and RACE-m, where it outperforms GPT-3 by some 25% in accuracy).

On the face of it, 65% looks like an excellent score for a model that has not been trained on the rather challenging dataset that we are using. However, when looking closer at the data, we find that the model performs very poorly with neutral samples; indeed, none of the models is able to recognize such neutral relationships with a accuracy of more than 10%. Given that pure chance would still yield an accuracy of some 33%, this is a very poor performance.

We have therefore further probed the heuristics that the models might be using that could cause the poor performance with neutral labels. Manual inspection showed that they respond strongly to symmetries regarding quantifiers and negations between premises and hypotheses. In particular, if either both or none of the premise and the hypothesis contain a “some” (existential quantifier) or a negation (the symmetric conditions), then the models are strongly biased to predict entailment (see figure 2).

Figure 2: Predicted labels for patterns that are symmetric between premise and hypothesis regarding existential quantifier and negation.

Conversely, if the pattern contains an asymmetry regarding existential quantifier and negation between premise and hypothesis, then the models are very strongly inclined to predict contradiction (see figure 3).

Figure 3: Predicted labels for patterns that are asymmetric between premise and hypothesis regarding existential quantifier and negation.

We have been surprised to see that the models are sensitive not only to symmetry regarding the negation particle, but also regarding the quantifier.

In the case of the contradiction and entailment pairs, these heuristics serve the models very well in our dataset, resulting in impressive performance. However, when applied to the neutral samples, the heuristics break down, performance falls far below simple guessing.

The observations made regarding the MNLI dataset (see above, section 3.1), together with the overlap biases observed by McCoy et al. (2019) can to some extent help to explain the models’ behavior with neutral samples. Our observations regarding the features of the MNLI dataset, the alogical premises and the negation bias, suggest that a model fine-tuned on this dataset would struggle to identify neutral samples as such: If alogical premises are also part of contradiction and entailment pairs, and if negated sentences generally indicate negation, then the models would be expected to struggle to identify neutral samples that contain negation and that are also for humans somewhat difficult to identify as such.

Furthermore, as the hypotheses overlap almost entirely with the premises in our neutral samples, the biases observed by McCoy et al. (2019) would be expected to contribute to this failure to identify the neutral samples as such.

One could wonder whether these results are of any significance for the models investigated. One could argue that it is no surprise that the models perform poorly on samples that differ relevantly from the ones that they have seen during training. Hence, it is to be expected – and no reason for concern – that the models fail to perform well at these tasks.

In response to this, we would like to raise attention to the fact that the task is called natural language inference, and the concepts that the models are intended to learn are entailment and contradiction. If it turns out, as we suggest it has in our study and in similar ones referenced above, section 2, that the models’ performance collapses even though the same logical relationships are in focus, while some superficial cues have disappeared, then it is wrong to say that the models are addressing the task of NLI, or that they are learning the logical concepts. Rather, the models are picking up spurious statistical cues that are correlated to the logical concepts in some dataset such as MNLI, but that are entirely unrelated to them in our dataset.

In other words, we suggest that the current lack of generalization beyond the training dataset that we can observe in our study (but which is also more widely acknowledged, see the references in section 1 and 2) is indeed a reason for concern. It implies that the models do not actually learn NLI but rather the exploitation of spurious statistical cues in the dataset, leading to shallow heuristics.

7 Conclusion

So far, we have investigated the ability of state-of-the-art transformer-based PLMs fine-tuned on the common NLI datasets to recognize formal logical relationships in syllogistic patterns. Our results show that the models are very good at distinguishing entailment from contradiction, but very bad, much worse than chance, at distinguishing either from neutral. Our analysis has suggested that this is due to the PLMs’ use of shallow heuristics, in particular with the attention to symmetries regarding negation and quantifiers between premises and hypotheses. We suggest that our study adds to the evidence that current transformer-based models do not actually learn NLI.


Appendix A Full Instructions Given to Crowdworkers

Williams et al. (2018, 1114) specifies the following tasks for the crowdworkers:

“This task will involve reading a line from a non-fiction article and writing three sentences that relate to it. The line will describe a situation or event. Using only this description and what you know about the world:

  • Write one sentence that is definitely correct about the situation or event in the line.

  • Write one sentence that might be correct about the situation or event in the line.

  • Write one sentence that is definitely incorrect about the situation or event in the line. ”

Appendix B Method used for evaluation of Models on MNLI

To evaluate the models, we have used Huggingface’s trainer API, see Huggingface (Wolf et al., 2019). In particular, we followed the instructions in the notebook here. We evaluated the models using the API out-of-the-box, with the following exceptions:

  1. The textattack-models had as labels "LABEL_0, LABEL_1, LABEL_2", which could not be read by the function that ensures that the labels are used equivalently by both model and dataset; hence, we reconfigured the models to use as labels “contradiction, entailment, neutral”.

  2. facebook-bart-large-mnli by textattack posed two additional challenges.

    1. Due to out of memory issues, we had to split up processing of the validation set into three chunks, averaging the accuracy received afterwards.

    2. The logits containing the predictions issued by facebook-bart-large-mnli could not be processed by the evaluation function, which caused the need to select only the first slice of the tensor that the model was issuing, ensuring that the metric function got a 1-dimensional tensor to compute accuracy.