Of late, large scale pre-trained Transformer-based (vaswani-etal-2017-attention) models—such as RoBERTa (liu-et-al-2019-roberta), BART (lewis-etal-2020-bart)
, and GPT-2 and -3(radford-etal-2019-language; brown-etal-2020-gpt3)
—have exceeded recurrent neural networks’ performance on many NLU tasks(wang-etal-2018-glue; wang-etal-2019-superglue). In particular, several papers have even suggested that Transformers pretrained on a language modeling (LM) objective capture syntactic information (goldberg2019assessing; hewitt-manning-2019-structural), at least to some reasonable extent (warstadt-etal-2019-investigating), and have shown that their self-attention layers are capable of surprisingly effective learning mechanisms rogers2020. In this work, we raise questions about claims that current models “know syntax”.
Since there are many ways to investigate “syntax”, we must be clear on what we mean by the term. Clearly, “language is not merely a bag of words” (harris-1954-distributional, p.156). A natural and common perspective from many formal theories of linguistics (e.g., chomsky-1995-minimalist) is that knowing a natural language requires that you know the syntax of that language. Knowing the syntax of a sentence means being sensitive to (at least) the order of the words in that sentence. For example, it is well known that humans exhibit a “sentence superiority effect” (cattell-1886-time; scheerer1981early)—it is easier for us to identify or recall words presented in canonical orders than in disordered, ungrammatical sentences (toyota-2001-changes; baddeley-etal-2009-working; snell-grainger-2017-sentence; wen-etal-2019-parallel, i.a.). Generally, knowing the syntax of a sentence is taken to be a prerequisite for understanding what that sentence means (heim-kratzer-1998-semantics). If performing an NLU task actually requires a humanlike understanding of sentence meaning, and thus of syntax as we defined it, then NLU models should be sensitive to word order.
To investigate if this the case, we focus on textual entailment, one of the hallmark tasks used to measure the linguistic reasoning capacity of Natural Language Understanding (NLU) models (condoravdi-etal-2003-entailment; dagan-etal-2005-pascal). This task, often also called Natural Language Inference (NLI; bowman-etal-2015-large, i.a.) typically consists of two sentences, a premise and a hypothesis; the objective is to predict whether the hypothesis entails the premise, contradicts it, or is neutral with respect to it. We perform a battery of tests on models trained to perform NLI (transformers pre-trained on LM, a CNN and an RNN) where we permute the original word order of examples such that no word is present in its original position, and the relative word ordering is minimized (see Table 1).
We find, somewhat surprisingly, that for nearly all premise-hypothesis pairs there are permutations that fool the models
into providing the correct prediction. We verify our findings with a range of English NLI datasets, including SNLI(bowman-etal-2015-large)
, MultiNLI(williams-etal-2018-broad) and ANLI (nie-etal-2020-adversarial). We observe similar results in Original Chinese NLI corpus (OCNLI; hu-etal-2020-ocnli), which makes it unlikely that our findings are English-specific. Our main contributions are:
We propose a metric suite (Permutation Acceptance) for evaluating NLU models’ insensitivity to unnatural word orders. (§ 3)
We construct permuted sets for multiple test dataset splits to investigate Permutation Acceptance, and measure NLI model performance on permuted sentences via several large scale tests (§ 5).
We perform an initial attempt to mitigate such issues in NLI models by devising a simple maximum entropy based method (§ 7).
We show that NLI models focus on words rather than word order. However, they seem to be able to partially reconstruct some syntax from permuted examples (§ 6), as we explore with metrics for word overlap and Part-of-Speech overlap.
Finally, we provide evidence that humans struggle to perform this task of unnatural language inference (§ 8).
2 Related Work
Researchers in NLP have studied syntactic structure in neural networks going back to tabor-1994-syntactic. Anecdotally, anyone who has interacted with large generative language models like GPT-2 or -3 will have marveled at their human-like ability to generate fluent and grammatical text(goldberg2019assessing; wolf2019some). When researchers have attempted to peek inside transformer LM’s pretrained representations, familiar syntactic representations (hewitt-manning-2019-structural), or a familiar order of linguistic operations (tenney-etal-2019-bert), appear.
There is also evidence, notably from agreement attraction phenomena (linzen-etal-2016-assessing), that transformer-based models pretrained on an LM objective do acquire some knowledge of natural language syntax (gulordava-etal-2018-colorless; chrupala-alishahi-2019-correlating; jawahar-etal-2019-bert; lin-etal-2019-open; manning-etal-2020-emergent; hawkins-etal-2020-investigating; linzen-baroni-2021-syntactic). The claim that LMs acquire some syntactic knowledge has been made not only for transformers, but also for convolutional neural nets (bernardy-lappin-2017-using), and RNNs (gulordava-etal-2018-colorless; van-schijndel-linzen-2018-neural; wilcox-etal-2018-rnn; zhang-bowman-2018-language; prasad-etal-2019-using; ravfogel-etal-2019-studying)—although there are many caveats (e.g., ravfogel-etal-2018-lstm; white-etal-2018-lexicosyntactic; davis-van-schijndel-2020-recurrent; chaves-2020-dont; da-costa-chaves-2020-assessing; kodner-gupta-2020-overestimation).
Several works have debated the extent to which NLI models in particular know syntax (although each work adopts a slightly different idea of what “knowing syntax” entails). For example, mccoy-etal-2019-right argued that the knowledge acquired by models trained on NLI (for at least some popular datasets) is actually not as syntactically sophisticated as it might have initially seemed; some transformer models rely mainly on simpler, non-humanlike heuristics. In general, transformer LM performance has been found to be patchy and variable across linguistic phenomena(dasgupta-etal-2018-evaluating; naik-etal-2018-stress; an-etal-2019-representation; ravichander-etal-2019-equate; jeretic-etal-2020-natural). This is especially true for syntactic phenomena (marvin-linzen-2018-targeted; hu-etal-2020-systematic; gauthier-etal-2020-syntaxgym; mccoy-etal-2020-berts; warstadt-etal-2020-blimp), where transformers are, for some phenomena and settings, worse than RNNs (van-schijndel-etal-2019-quantity). From another angle, many have explored architectural approaches for increasing a network’s sensitivity to syntactic structure (chen-etal-2017-enhanced; Li-etal-2020-SANLI). williams-etal-2018-latent showed that models that learn jointly to perform NLI well and to parse do not generate parse trees that match popular syntactic formalisms. Furthermore, models trained explicitly to differentiate acceptable sentences from unacceptable ones (i.e., one of the most common syntactic tests used by linguists) have, to date, come nowhere near human performance (warstadt-etal-2019-neural).
Additionally, NLI models often over-attend to particular words to predict the correct answer (gururangan-etal-2018-annotation; clark-etal-2019-bert). wallace-etal-2019-universal show that some short sequences of non-human-readable text can fool many NLU models, including NLI models trained on SNLI, into predicting a specific label. In fact, ettinger-2020-whatbertisnot observed that for one of three test sets, BERT loses some accuracy in word-perturbed sentences, but that there exists a subset of examples for which BERT’s accuracy remains intact. This led ettinger-2020-whatbertisnot
to speculate that “some of BERT’s success on these items may be attributable to simpler lexical or n-gram information”.
Thus, it is reasonable to wonder whether models can perform equally well on permuted sentences since the model is able to view the same collection of words. If this is the case, it suggests that these state-of-the-art models actually perform as bag-of-words models blei-etal-2003-latent; mikolov2013efficient
, i.e. they focus little on word order. This being said, there are empirical reasons (in addition to the theoretical ones describe above) to expect that a knowledge of word order is crucial for performing NLI. An early human annotation effort on top of the PASCAL RTE dataset(dagan-etal-2006) discovered that “syntactic information alone is sufficient to make a judgment” on one third of the textual entailment examples in RTE, whereas almost a half could be solved if additionally provided a thesaurus (vanderwende2005syntax).
3 Syntactic Permutation Acceptance
As we mentioned, linguists generally take syntactic structure to be necessary for humans to determine the meaning of sentences. Many find the NLI task to a very promising approximation of human natural language understanding, in part because it is rooted in the tradition of logical entailment. In the spirit of propositional logic, sentence meaning is taken to be truth-conditional (frege1948sense; montague-1970-universal; chierchia-mcconnell-1990-meaning; heim-kratzer-1998-semantics). That is to say that understanding a sentence is equivalent to knowing the actual conditions of the world under which the sentences would be (judged) true (wittgenstein-1922-tractatus). If grammatical sentences are required for sentential inference, as per a truth conditional approach (montague-1970-universal), then permuted sentences should be meaningless. Put another way, the meanings of highly permuted sentences (if they exist) are not propositions, and thus don’t have truth conditions. Only from the truth conditions of sentences can we tell if they entail each other or not. In short, we shouldn’t expect the task of textual entailment to be defined at all in our “unnatural” case.
For our purposes, we hypothesize that a syntax-aware model might perform NLI in one of a few ways. First, if for every example in the dataset all of its permuted counterparts were hopelessly mixed up, the model might assign near zero probability mass on the gold label (effectively recognizing the ungrammaticality). In this somewhat extreme case, no examples would have any of their permuted counterpartsaccepted (i.e., assigned the original example’s gold label) by the model. Second, permuted examples might instead baffle the model into assigning equal probability mass on all three labels, and effectively choosing a label at random (this is inconsistent with our findings). Third, the model might just be unable to interpret permuted sentences at all. In that case, it might either just always assign a particular label (e.g., neutral) or assign a non-entailment label (along the lines of 2-class textual entailment datasets like RTE, dagan-etal-2005-pascal); neither option is consistent with what we find. However, if the model doesn’t care about word order and operates similarly to a bag-of-words, it would accept permuted examples at a high rate.
We show that none of the investigated models care about word order according to our suite of Permutation Acceptance metric that quantify how many the permuted sentences are accepted, i.e. are assigned the gold label, by a model. In light of the “sentence superiority” findings, we argue that if the syntax is corrupted in a way such that the resulting sentences are ungrammatical/meaningless, humans (and human-like NLU models) should achieve very low Permutation Acceptance scores. We show that humans indeed struggle with unnatural language inference data in § 8.
Constructing the permuted dataset.
Based on our notion of Permutation Acceptance, we devise a series of experiments using trained models on various NLI datasets. Concretely, for a given dataset having splits and , we train an NLI model first on that achieves performance comparable to that was reported reported in the original papers. We then construct a randomized version of , which we term as such that: for each example (where and are the premise and hypothesis sentences of the example respectively and is the gold label), we use a permutation operator that returns a list () of permuted sentences ( and ), where
is a hyperparameter.essentially permutes all positions of the words in a given sentence (i.e., either in premise or hypothesis) with the restriction that no words are in their original position. In our initial setting, we do not explicitly control the placement of the words relative to their original neighbors, but we analyze such clumping of words effect in § 5.
Thus, now consists of examples, with different permutations of hypothesis and premise for each original test example pair. If a sentence contains words, then the total number of available permutations are , thus making the output of a list of permutations.
Defining Permutation Acceptance.
The choice of naturally allows us to analyze a statistical view of the predictability of a model on the permuted sentences. To that end, we define the following notational conventions. Let be the original accuracy of a given model on a dataset , and c be the number of examples in a dataset which are marked as correct according to the above formulation of standard dataset accuracy. Typically is given by or , where is the number of examples which are predicted correctly compared to ground truth.
Let be the percentage of permutations of an example deemed correct (i.e. assigned the ground truth label) by the model :
Let be the percentage of examples for which exceeds a threshold . Concretely, a given and will count as being predicted correctly according to if more than percent of its permutations ( and ) are assigned the gold label by the model . Mathematically,
There are two specific cases of the that we are most interested in. First, we define or the Maximum Accuracy, where . In short, gives the percentage of examples for which at least one of model assigns the gold label . Second, we define , or random baseline accuracy, where or chance probability (for balanced 3-way classification). This metric is less stringent than , and provides a lower-bound relaxation.
We also define to be the list of examples originally marked incorrect according to , but are now deemed correct according . is the list of examples originally marked correct according to . Thus, necessarily. Additionally, we define and , ranging from to , as the dataset average percentage of permuted examples deemed correct, when the examples were originally correct () and when the examples were originally incorrect () as per (hence, flipped).
Since the permutation function results in ungrammatical, nonsensical sentences (Table 1), we have two hypotheses given our discussion so far: (a) permutations resulting from permutation operator will elicit random outcomes, with and being in random uniform probability of , (b) permutations from will receive incorrect predictions, resulting in and . However, we neither observe (a) nor (b) with the state-of-the-art models.
We present results for two types of models: (a) Transformer-based models and (b) Non-Transformer Models. In Transformer-based models, we investigate the state-of-the-art pre-trained models such as RoBertA (large) liu-et-al-2019-roberta, BART (large) lewis-etal-2020-bart as well as a relatively small DistilBERT model sanh2020distilbert. For (b) we consider several pre-Transformer era recurrent and convolution based neural networks, such as InferSent conneau2017supervised, Bidirectional LSTM collobert2008unified and ConvNet zhao2015self. We train all models on MNLI williams-etal-2018-broad, and evaluate on in-distribution (SNLI bowman-etal-2015-large and MNLI) and out-of-distribution datasets (ANLI nie-etal-2020-adversarial). We independently verified Transformer-based results on our trained model using HuggingFace Transformers wolf2020transformers, as well as pre-trained checkpoints from FairSeq ott2019fairseq
using PyTorch Model Hub. For Non-Transformer models, we use the codebase fromconneau2017supervised. We use , and use 100 seeds for the randomizations for each example in to ensure full reproducibility. We drop examples from test sets where we are unable to compute all unique randomizations, typically these are examples with sentences of length of less than 6 tokens. Code, randomized data, and model checkpoints will be released publicly.
Models accept many permuted examples.
We find on models trained on MNLI and evaluation on MNLI (in-domain generalization) is significantly high: 98.7% on MNLI dev and test sets. This shows there exists at least one permutation for almost all examples in such that the model predicts the correct answer. We also observe significantly high at 79.4%, suggesting the models outdo even a random baseline in accepting permuted, ungrammatical sentences.
Furthermore, we observe similar effects of in out-of-domain generalization on evaluating with ANLI dataset splits, where is signficantly higher than . As a consequence, we encounter many flips, where the example was originally predicted incorrectly by the model but at least one permutation for that example elicits the correct response. However, recall this analysis expects us to know the correct label upfront, so this test can be thought of as running a syntax-based stress test on the model until we reach the correct answer (or give up by exhausting our set of permutations, ).
In the case of out-of-domain generalization, reduces considerably. Probability of permuted sentences to be predicted correctly also is significantly higher for examples which were predicted correctly ( for all test splits). These two results suggest that the investigated NLU models are acting as bag-of-words models, where it is harder to find the correct permutation for already misclassified, non-generalizable sentences.
Models are very confident.
The phenomenon we observe would be of less concern if the correct label prediction was just an outcome of chance, which could occur when the entropy of the log probabilities of the model output is high (suggesting uniform probabilities on entailment, neutral and contradiction labels). We first investigate the model probabilities for the Transformer-based models on the permutations that lead to the correct answer in Figure 1
. We find overwhelming evidence that model confidences on in-distribution datasets (MNLI, SNLI) are highly skewed, resulting in low entropy, and it varies among different model types. BART proves to be the most skewed among the three types of model we consider.
To investigate whether the skewedness is a function of model capacity, we investigate the log probabilities of a lower capacity model DistilBERT on random perturbations, and find it is very similar to the RoBERTa (large) model, although, DistilBERT does exhibit lower , , and . This suggesting its inability to understand certain premise-hypothesis pairs completely due to lack of capacity.
For non-Transformers whose accuracy is lower, we observe the relative performance in the terms of (Table 3) and Average entropy (Figure 1) for these classes of models. As expected, since the non-Transformer models are significantly worse, the achieved by these models are also lower. However, while comparing the averaged entropy of the model predictions, it is clear that there is some benefit to being a worse model—the models are not as overconfident on randomized sentences as they are Transformers.
Similar artifacts in Chinese NLU.
To conclusively verify the observation in NLI, we extended the experiments to Original Chinese NLI dataset (hu-etal-2020-ocnli, OCNLI). We re-use pre-trained RoBERTa (large) model and InferSent (non-Transformer) models on OCNLI. We find similar observations (Table 4), thereby suggesting that the phenomenon is not just an artifact of English text, but natural language understanding as a whole.
6 Analyzing Syntactic Structure Associated with Tokens
Since all investigated models have relatively high permutation acceptance, we are lead to ask: what properties of particular permutations lead them to be accepted? We perform two initial analyses to shed light on this question. First, we ask to what extent preserving local word order despite permutation is correlated with higher Permutation Acceptance scores. We find that there is some correlation but it doesn’t fully explain the high Permutation Acceptance scores. Second, we ask whether is related to a more abstract measure of local word relations, i.e., part-of-speech (POS) neighborhood. We find that there is little effect of POS-neighbors for non-Transformer models, but that RoBERTa, BART, and DistilBERT show a distinct effect. Taken together these analyses suggest that some local word order information affects models’ Permutation Acceptance scores, and perhaps incorporating methods of decreasing model reliance on this information could be fruitful.
Preserving Local Word Order Leads to Higher Permutation Acceptance.
In our initial experiments, we randomized the word order of the sentences with the constraint that no word appear in its original position. This kind of randomization can preserve the relative positions of n-grams. To analyze the effect of this, we compute BLEU scores on 2-, 3- and 4 n-grams and compare the acceptability of the permuted sentences across different models. If preserved n-grams are driving the Permutation Acceptance effects, we should see a correlation between BLEU scores and . As a result of our permutation process, the maximum BLEU-3 and BLEU-4 scores are negligibly low ( BLEU-3 and BLEU-4), already calling into question the hypothesis that n-grams are the sole explanation for our finding. Because of this, we only compare BLEU-2 scores. (Detailed experiments on specially constructed permutations that cover the entire range of BLEU-3 and BLEU-4 is provided in Appendix D). We find that the probability of a permuted sentence to be predicted correctly by the model correlates BLEU-2 score (Figure 2). However, the base prediction of Transformer-based models is still far from random (66% for BLEU-2 range of ), and hence requires further investigation on the language processing mechanisms employed by these models.
Part-of-speech neighborhood tracks Permutation Acceptance.
Many syntactic formalisms, like Lexical Functional Grammar (kaplan-bresnan-1995-formal; bresnan-etal-2015-lexical, LFG), Head-drive Phrase Structure Grammar (pollard-sag-1994-head, HPSG) or Lexicalized Tree Adjoining Grammar (schabes-etal-1988-parsing; abeille-1990-lexical, LTAG), are “lexicalized”, i.e., individual words or morphemes bear syntactic features telling you which other things they can combine with. Taking LTAG as an example, two lexicalized trees would be associated with the verb “buy”, one which projects a phrase structure minitree containing two noun phrases (for the subject and direct object, as in Kim bought a book), and one which projects a minitree containing three (one for the subject, another for the direct object, and a third for the indirect object, as in Kim bought Logan a book). In this way, an average tree family for any particular word would provide information about what sort of syntactic contexts the word generally appears in, or roughly which syntactic neighbors the word has. Taking inspiration from lexicalized grammatical formalisms, we speculate that our NLI models might be performing well on permuted examples because they are reconstructing perhaps noisily the word order of a sentence from its words.
To test this, we operationalized the idea of a lexicalized minitree. First, we POS tagged every example in the corpus using 17 Universal Part-of-Speech tags (using spaCy, spacy2). For each , we compute the occurrence probability of POS tags on tokens in the neighborhood of for each in containing . The neighborhood is specified by the radius (a symmetrical window tokens from to the left and right). We denote this sentence level probability of POS tags for a word as (see Figure 6). These sentence-level word POS neighbor scores can be averaged to get a corpus-level POS tag minitree probability (i.e., a type level score). Then, for a sentence , for each word , we compute a POS minitree overlap score as follows:
Concretely, computes the overlap of top- POS tags in the neighborhood of a word with that of the train statistic. If a word has the same mini-tree in the permuted sentence as is the training set, then the overlap would be 1. For a given sentence , the aggregate is defined by the average of the overlap scores of the constituent words: , and we call it a POS minitree signature.
Now, we can compute the same score for a permuted sentence to have . The idea is if the permuted sentence POS signature comes close to the true sentence, then the ratio of will be close to 1. If the ratio is 1, that suggests the permuted sentence overlaps more than the original sentence with the train statistic. Put in other words, if high overlap correlates with percentage of permutations deemed correct (even in randomized sentences), then our models treat words as if they bear syntactic minitrees. Therefore, for where many of the permutations have high average POS minitree overlap score, we should expect a higher prediction accuracy.
We investigate the relationship with percentage of permuted sentences accepted with in Figure 3. We observe that the POS Tag Minitree hypothesis holds for Transformer-based models, RoBERTa, BART and DistilBERT, where the percentage of accepted pairs increase as the sentences have higher overlap with the un-permuted sentence in terms of POS signature. For non-Transformer models such as InferSent, ConvNet and BiLSTM models, the POS signature ratio to percentage of correct permutation remains the same or decreases, suggesting that the reasoning process employed by these models does not preserve local abstract syntax structure (i.e., POS neighbor relations).
7 Maximum Entropy Training
Here, we propose an initial attempt to mitigate the effect of correct prediction on permuted examples. We observed that the log probabilities of the output of a model on permuted examples are significantly higher than random. This kind of phenomenon has been observed prior in Computer Visiongandhi2019mutual, and suggests model struggle to to learn mutually exclusively. Neural networks tend to output higher confidence than random for even unknown inputs, which might be an underlying cause of the high Permutation Acceptance phenomenon in our NLI models.
Since our ideal model would be ambivalent about the randomized ungrammatical sentences, we devise a simple objective to train NLU models by baking in the Mutual Exclusivity principle through Maximizing Entropy. Concretely, we train RoBERTa model to do well on the MNLI dataset while maximizing the entropy () on a subset of randomized examples () for each example (). We use a
randomizations for each example, and modify the loss function as follows:L=argmin_θ∑_((p, h),y)ylog(p(y—(p,h);θ)) + ∑_i=1^n H(y—(^p_i,^h_i);θ) Using this simple maximum entropy method, we find that the model improves considerably with respect to its robustness to randomized sentences (Figure 4), all without taking a hit in the model accuracy (Table 5). We observe that none of the models reach score close to 0, suggesting further room to explore other methods for decreasing models’ Permutation Acceptance.
8 Human Evaluation
|Evaluator||Accuracy||Macro F1||Acc on||Acc on|
Since our models often accept permuted sentences, we ask how humans perform unnatural language inference on permuted sentences. We expect humans to struggle with the task, given our intuitions and the sentence superiority findings, but to test this, we presented two experts in Natural Language Inference (one a linguist) with a random sample of 200 permuted sentence pairs, and asked them to predict the entailment relation. The experts were provided with no information about the examples from which the permutations were drawn (above and beyond the common knowledge that NLI is usually defined as a roughly balanced 3-way classification task). Unbeknownst to the experts, all permuted sentences in the sample were actually deemed to be accepted by RoBERTa model (trained on MNLI dataset). We observe that the experts performed much worse than RoBERTa (Table 6), although their accuracy was a bit higher than random. In a second sample, and again, unbeknownst to the experts, we provided permuted sentence pairs from the MNLI Matched Dev. set some of which were originally predicted correctly and some incorrectly by RoBERTa (large) model. We find that for both experts, accuracy on permutations from originally correct examples was higher than on the incorrect examples, which verifies the dasgupta-etal-2018-evaluating, gururangan-etal-2018-annotation, naik-etal-2019-exploring that word overlap is important.
9 Future Work & Conclusion
While we have shown that classification labels can be flipped based solely on a sentence reordering, future work could also explore the relationship between permutation and deletion. Although our results tentatively support the hypothesis that current models do not know “know syntax” in a human-like way (according to our definition) and are mostly just sensitive to words (or perhaps n-grams), they are preliminary, and future work is required to fully understand human-like classification of permuted NLI examples.
In this work we show that state-of-the-art models does not rely on sentence structure the way we think they should. On the task of Natural Language Inference, we show that models (both Transformer-based models, RNNs, and ConvNets) are largely insensitive to permutations of word order that corrupt the original syntax. This raises questions about the extent to which such systems understand “syntax”, and highlights the unnatural language understanding processes they employ.
A few years ago, manning-etal-2015-computational encouraged NLP to consider “the details of human language, how it is learned, processed, and how it changes, rather than just chasing state-of-the-art numbers on a benchmark task.” We expand upon this view, and suggest one particular future direction: we should train models not only to do well, but also to not to overgeneralize to corrupted input.
Thanks to Shagun Sodhani, Hagen Blix, Ryan Cotterell, Emily Dinan, Nikita Nangia, Grusha Prasad, and Roy Schwartz for many invaluable comments and feedback on early drafts.
Appendix A Effect of Length on Permutation Acceptance
We investigate the effect of length on Permutation Acceptancein Figure 5. We observe that shorter sentences in general have higher probability of acceptability for examples which was originally predicted correctly - since shorter sentences have less number of unique permutations. However, for the examples which were originally incorrect, the trend is not present.
Appendix B Example of POS Minitree
As we defined in § 6, we develop a POS signature for each word in a sentence in a test set, comparing it with the distribution of the same word in the training set. Figure 6 provides a snapshot a word “river” from the test set and how the POS signature distribution of the word in a particular sentence match with that of aggregated training statistic. In practice, we the top k tags for the word in test signature as well as the train, and calculate the overlap of POS tags. When comparing the model performance with permuted sentences, we compute a ratio between this overlap score and the overlap score of the permuted sentence. In the Figure 6, ‘river’ would have a POS tag minitree score of 0.75.
Appendix C Effect of Hypothesis only randomization
In recent years, the impact of the hypothesis sentence (gururangan-etal-2018-annotation; tsuchiya-2018-performance; poliak-etal-2018-hypothesis) on NLI classification has been a topic of much interest. As we define in § 3, logical entailment can only be defined for pairs of propositions. We investigated one effect where we randomize only the hypothesis sentences while keeping the premise intact. We find (Figure 8) that the value is almost similar for the two schemes, suggesting that even with randomizing the hypothesis the model can exhibit similar phenomenon.
Appendix D Effect of clumped words in random permutations
Since our original permuted dataset consists of extremely randomized words, we observe very low BLEU-3 ( 0.2) and BLEU-4 scores ( 0.1). To study the effect of overlap across a wider range of permutations, we devised an experiment where we clump certain words together before performing random permutations. Concretely, we clump 25%, 50% and 75% of the words in a sentence and then permute the remaining words and the clumped word as a whole. This type of clumped-permutation allows us to study the full range of BLEU-2/3/4 scores, which we present in Figure 9. As expected, the acceptability of permuted sentences increase linearly with BLEU score overlap.
Appendix E Effect of the threshold of in various test splits
We defined two variations of , and , but theoretically it is possible to define any arbitrary threshold percentage to evaluate the unnatural language inference mechanisms of different models. In Figure 7 we show the effect of different thresholds, including where and where . We observe for in-distribution datasets (top row, MNLI and SNLI splits), in the extreme setting when , there are more than 10% of examples available, and more than 25% in case of InferSent and DistilBERT. For out-of-distribution datasets (bottom row, ANLI splits) we observe a much lower trend, suggesting generalization itself is the bottleneck in permuted sentence understanding.