Unnatural Language Inference

12/30/2020 ∙ by Koustuv Sinha, et al. ∙ 5

Natural Language Understanding has witnessed a watershed moment with the introduction of large pre-trained Transformer networks. These models achieve state-of-the-art on various tasks, notably including Natural Language Inference (NLI). Many studies have shown that the large representation space imbibed by the models encodes some syntactic and semantic information. However, to really "know syntax", a model must recognize when its input violates syntactic rules and calculate inferences accordingly. In this work, we find that state-of-the-art NLI models, such as RoBERTa and BART are invariant to, and sometimes even perform better on, examples with randomly reordered words. With iterative search, we are able to construct randomized versions of NLI test sets, which contain permuted hypothesis-premise pairs with the same words as the original, yet are classified with perfect accuracy by large pre-trained models, as well as pre-Transformer state-of-the-art encoders. We find the issue to be language and model invariant, and hence investigate the root cause. To partially alleviate this effect, we propose a simple training methodology. Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Of late, large scale pre-trained Transformer-based (vaswani-etal-2017-attention) models—such as RoBERTa (liu-et-al-2019-roberta), BART (lewis-etal-2020-bart)

, and GPT-2 and -3

(radford-etal-2019-language; brown-etal-2020-gpt3)

—have exceeded recurrent neural networks’ performance on many NLU tasks

(wang-etal-2018-glue; wang-etal-2019-superglue). In particular, several papers have even suggested that Transformers pretrained on a language modeling (LM) objective capture syntactic information (goldberg2019assessing; hewitt-manning-2019-structural), at least to some reasonable extent (warstadt-etal-2019-investigating), and have shown that their self-attention layers are capable of surprisingly effective learning mechanisms rogers2020. In this work, we raise questions about claims that current models “know syntax”.

max width= Gold Label Premise Hypothesis E Boats in daily use lie within feet of the fashionable bars and restaurants. There are boats close to bars and restaurants. E restaurants and use feet of fashionable lie the in Boats within bars daily . bars restaurants are There and to close boats . C He and his associates weren’t operating at the level of metaphor. He and his associates were operating at the level of the metaphor. C his at and metaphor the of were He operating associates n’t level . his the and metaphor level the were He at associates operating of .

Table 1: Examples from MNLI Matched development set. Both the original and the example with permuted sentences elicits the same classification label (Entailment and Contradiction) from RoBERTa (large). We provide a simple demo of this behaviour in the associated Google Colab notebook.

Since there are many ways to investigate “syntax”, we must be clear on what we mean by the term. Clearly, “language is not merely a bag of words” (harris-1954-distributional, p.156). A natural and common perspective from many formal theories of linguistics (e.g., chomsky-1995-minimalist) is that knowing a natural language requires that you know the syntax of that language. Knowing the syntax of a sentence means being sensitive to (at least) the order of the words in that sentence. For example, it is well known that humans exhibit a “sentence superiority effect” (cattell-1886-time; scheerer1981early)—it is easier for us to identify or recall words presented in canonical orders than in disordered, ungrammatical sentences (toyota-2001-changes; baddeley-etal-2009-working; snell-grainger-2017-sentence; wen-etal-2019-parallel, i.a.). Generally, knowing the syntax of a sentence is taken to be a prerequisite for understanding what that sentence means (heim-kratzer-1998-semantics). If performing an NLU task actually requires a humanlike understanding of sentence meaning, and thus of syntax as we defined it, then NLU models should be sensitive to word order.

To investigate if this the case, we focus on textual entailment, one of the hallmark tasks used to measure the linguistic reasoning capacity of Natural Language Understanding (NLU) models (condoravdi-etal-2003-entailment; dagan-etal-2005-pascal). This task, often also called Natural Language Inference (NLI; bowman-etal-2015-large, i.a.) typically consists of two sentences, a premise and a hypothesis; the objective is to predict whether the hypothesis entails the premise, contradicts it, or is neutral with respect to it. We perform a battery of tests on models trained to perform NLI (transformers pre-trained on LM, a CNN and an RNN) where we permute the original word order of examples such that no word is present in its original position, and the relative word ordering is minimized (see Table 1).

We find, somewhat surprisingly, that for nearly all premise-hypothesis pairs there are permutations that fool the models

into providing the correct prediction. We verify our findings with a range of English NLI datasets, including SNLI


, MultiNLI

(williams-etal-2018-broad) and ANLI (nie-etal-2020-adversarial). We observe similar results in Original Chinese NLI corpus (OCNLI; hu-etal-2020-ocnli), which makes it unlikely that our findings are English-specific. Our main contributions are:

  • We propose a metric suite (Permutation Acceptance) for evaluating NLU models’ insensitivity to unnatural word orders. (§ 3)

  • We construct permuted sets for multiple test dataset splits to investigate Permutation Acceptance, and measure NLI model performance on permuted sentences via several large scale tests (§ 5).

  • We perform an initial attempt to mitigate such issues in NLI models by devising a simple maximum entropy based method (§ 7).

  • We show that NLI models focus on words rather than word order. However, they seem to be able to partially reconstruct some syntax from permuted examples (§ 6), as we explore with metrics for word overlap and Part-of-Speech overlap.

  • Finally, we provide evidence that humans struggle to perform this task of unnatural language inference (§ 8).

2 Related Work

Researchers in NLP have studied syntactic structure in neural networks going back to tabor-1994-syntactic. Anecdotally, anyone who has interacted with large generative language models like GPT-2 or -3 will have marveled at their human-like ability to generate fluent and grammatical text

(goldberg2019assessing; wolf2019some). When researchers have attempted to peek inside transformer LM’s pretrained representations, familiar syntactic representations (hewitt-manning-2019-structural), or a familiar order of linguistic operations (tenney-etal-2019-bert), appear.

There is also evidence, notably from agreement attraction phenomena (linzen-etal-2016-assessing), that transformer-based models pretrained on an LM objective do acquire some knowledge of natural language syntax (gulordava-etal-2018-colorless; chrupala-alishahi-2019-correlating; jawahar-etal-2019-bert; lin-etal-2019-open; manning-etal-2020-emergent; hawkins-etal-2020-investigating; linzen-baroni-2021-syntactic). The claim that LMs acquire some syntactic knowledge has been made not only for transformers, but also for convolutional neural nets (bernardy-lappin-2017-using), and RNNs (gulordava-etal-2018-colorless; van-schijndel-linzen-2018-neural; wilcox-etal-2018-rnn; zhang-bowman-2018-language; prasad-etal-2019-using; ravfogel-etal-2019-studying)—although there are many caveats (e.g., ravfogel-etal-2018-lstm; white-etal-2018-lexicosyntactic; davis-van-schijndel-2020-recurrent; chaves-2020-dont; da-costa-chaves-2020-assessing; kodner-gupta-2020-overestimation).

Several works have debated the extent to which NLI models in particular know syntax (although each work adopts a slightly different idea of what “knowing syntax” entails). For example, mccoy-etal-2019-right argued that the knowledge acquired by models trained on NLI (for at least some popular datasets) is actually not as syntactically sophisticated as it might have initially seemed; some transformer models rely mainly on simpler, non-humanlike heuristics. In general, transformer LM performance has been found to be patchy and variable across linguistic phenomena

(dasgupta-etal-2018-evaluating; naik-etal-2018-stress; an-etal-2019-representation; ravichander-etal-2019-equate; jeretic-etal-2020-natural). This is especially true for syntactic phenomena (marvin-linzen-2018-targeted; hu-etal-2020-systematic; gauthier-etal-2020-syntaxgym; mccoy-etal-2020-berts; warstadt-etal-2020-blimp), where transformers are, for some phenomena and settings, worse than RNNs (van-schijndel-etal-2019-quantity). From another angle, many have explored architectural approaches for increasing a network’s sensitivity to syntactic structure (chen-etal-2017-enhanced; Li-etal-2020-SANLI). williams-etal-2018-latent showed that models that learn jointly to perform NLI well and to parse do not generate parse trees that match popular syntactic formalisms. Furthermore, models trained explicitly to differentiate acceptable sentences from unacceptable ones (i.e., one of the most common syntactic tests used by linguists) have, to date, come nowhere near human performance (warstadt-etal-2019-neural).

Additionally, NLI models often over-attend to particular words to predict the correct answer (gururangan-etal-2018-annotation; clark-etal-2019-bert). wallace-etal-2019-universal show that some short sequences of non-human-readable text can fool many NLU models, including NLI models trained on SNLI, into predicting a specific label. In fact, ettinger-2020-whatbertisnot observed that for one of three test sets, BERT loses some accuracy in word-perturbed sentences, but that there exists a subset of examples for which BERT’s accuracy remains intact. This led ettinger-2020-whatbertisnot

to speculate that “some of BERT’s success on these items may be attributable to simpler lexical or n-gram information”.

Thus, it is reasonable to wonder whether models can perform equally well on permuted sentences since the model is able to view the same collection of words. If this is the case, it suggests that these state-of-the-art models actually perform as bag-of-words models blei-etal-2003-latent; mikolov2013efficient

, i.e. they focus little on word order. This being said, there are empirical reasons (in addition to the theoretical ones describe above) to expect that a knowledge of word order is crucial for performing NLI. An early human annotation effort on top of the PASCAL RTE dataset

(dagan-etal-2006) discovered that “syntactic information alone is sufficient to make a judgment” on one third of the textual entailment examples in RTE, whereas almost a half could be solved if additionally provided a thesaurus (vanderwende2005syntax).

3 Syntactic Permutation Acceptance

As we mentioned, linguists generally take syntactic structure to be necessary for humans to determine the meaning of sentences. Many find the NLI task to a very promising approximation of human natural language understanding, in part because it is rooted in the tradition of logical entailment. In the spirit of propositional logic, sentence meaning is taken to be truth-conditional (frege1948sense; montague-1970-universal; chierchia-mcconnell-1990-meaning; heim-kratzer-1998-semantics). That is to say that understanding a sentence is equivalent to knowing the actual conditions of the world under which the sentences would be (judged) true (wittgenstein-1922-tractatus). If grammatical sentences are required for sentential inference, as per a truth conditional approach (montague-1970-universal), then permuted sentences should be meaningless. Put another way, the meanings of highly permuted sentences (if they exist) are not propositions, and thus don’t have truth conditions. Only from the truth conditions of sentences can we tell if they entail each other or not. In short, we shouldn’t expect the task of textual entailment to be defined at all in our “unnatural” case.

For our purposes, we hypothesize that a syntax-aware model might perform NLI in one of a few ways. First, if for every example in the dataset all of its permuted counterparts were hopelessly mixed up, the model might assign near zero probability mass on the gold label (effectively recognizing the ungrammaticality). In this somewhat extreme case, no examples would have any of their permuted counterparts

accepted (i.e., assigned the original example’s gold label) by the model. Second, permuted examples might instead baffle the model into assigning equal probability mass on all three labels, and effectively choosing a label at random (this is inconsistent with our findings). Third, the model might just be unable to interpret permuted sentences at all. In that case, it might either just always assign a particular label (e.g., neutral) or assign a non-entailment label (along the lines of 2-class textual entailment datasets like RTE, dagan-etal-2005-pascal); neither option is consistent with what we find. However, if the model doesn’t care about word order and operates similarly to a bag-of-words, it would accept permuted examples at a high rate.

We show that none of the investigated models care about word order according to our suite of Permutation Acceptance metric that quantify how many the permuted sentences are accepted, i.e. are assigned the gold label, by a model. In light of the “sentence superiority” findings, we argue that if the syntax is corrupted in a way such that the resulting sentences are ungrammatical/meaningless, humans (and human-like NLU models) should achieve very low Permutation Acceptance scores. We show that humans indeed struggle with unnatural language inference data in § 8.

4 Methods

Constructing the permuted dataset.

Based on our notion of Permutation Acceptance, we devise a series of experiments using trained models on various NLI datasets. Concretely, for a given dataset having splits and , we train an NLI model first on that achieves performance comparable to that was reported reported in the original papers. We then construct a randomized version of , which we term as such that: for each example (where and are the premise and hypothesis sentences of the example respectively and is the gold label), we use a permutation operator that returns a list () of permuted sentences ( and ), where

is a hyperparameter.

essentially permutes all positions of the words in a given sentence (i.e., either in premise or hypothesis) with the restriction that no words are in their original position. In our initial setting, we do not explicitly control the placement of the words relative to their original neighbors, but we analyze such clumping of words effect in § 5.

Thus, now consists of examples, with different permutations of hypothesis and premise for each original test example pair. If a sentence contains words, then the total number of available permutations are , thus making the output of a list of permutations.

Defining Permutation Acceptance.

The choice of naturally allows us to analyze a statistical view of the predictability of a model on the permuted sentences. To that end, we define the following notational conventions. Let be the original accuracy of a given model on a dataset , and c be the number of examples in a dataset which are marked as correct according to the above formulation of standard dataset accuracy. Typically is given by or , where is the number of examples which are predicted correctly compared to ground truth.

Let be the percentage of permutations of an example deemed correct (i.e. assigned the ground truth label) by the model :


Let be the percentage of examples for which exceeds a threshold . Concretely, a given and will count as being predicted correctly according to if more than percent of its permutations ( and ) are assigned the gold label by the model . Mathematically,


There are two specific cases of the that we are most interested in. First, we define or the Maximum Accuracy, where . In short, gives the percentage of examples for which at least one of model assigns the gold label . Second, we define , or random baseline accuracy, where or chance probability (for balanced 3-way classification). This metric is less stringent than , and provides a lower-bound relaxation.

We also define to be the list of examples originally marked incorrect according to , but are now deemed correct according . is the list of examples originally marked correct according to . Thus, necessarily. Additionally, we define and , ranging from to , as the dataset average percentage of permuted examples deemed correct, when the examples were originally correct () and when the examples were originally incorrect () as per (hence, flipped).


Since the permutation function results in ungrammatical, nonsensical sentences (Table 1), we have two hypotheses given our discussion so far: (a) permutations resulting from permutation operator will elicit random outcomes, with and being in random uniform probability of , (b) permutations from will receive incorrect predictions, resulting in and . However, we neither observe (a) nor (b) with the state-of-the-art models.

5 Results

Model Eval Dataset
RoBERTa (large) MNLI_m_dev 0.906 0.987 0.707 0.383 0.794
MNLI_mm_dev 0.901 0.987 0.707 0.387 0.790
SNLI_dev 0.879 0.988 0.768 0.393 0.826
SNLI_test 0.883 0.988 0.760 0.407 0.828
A1_dev 0.456 0.897 0.392 0.286 0.364
A2_dev 0.271 0.889 0.465 0.292 0.359
A3_dev 0.268 0.902 0.480 0.308 0.397
Mean 0.652 0.948 0.611 0.351 0.623
Harmonic Mean 0.497 0.946 0.572 0.344 0.539
BART (large) MNLI_m_dev 0.902 0.989 0.689 0.393 0.784
MNLI_mm_dev 0.900 0.986 0.695 0.399 0.788
SNLI_dev 0.886 0.991 0.762 0.363 0.834
SNLI_test 0.888 0.990 0.762 0.370 0.836
A1_dev 0.455 0.894 0.379 0.295 0.374
A2_dev 0.316 0.887 0.428 0.303 0.397
A3_dev 0.327 0.931 0.428 0.333 0.424
Mean 0.668 0.953 0.592 0.351 0.634
Harmonic Mean 0.543 0.951 0.546 0.347 0.561
DistilBERT MNLI_m_dev 0.800 0.968 0.775 0.343 0.779
MNLI_mm_dev 0.811 0.968 0.775 0.346 0.786
SNLI_dev 0.732 0.956 0.767 0.307 0.731
SNLI_test 0.738 0.950 0.770 0.312 0.725
A1_dev 0.251 0.750 0.511 0.267 0.300
A2_dev 0.300 0.760 0.619 0.265 0.343
A3_dev 0.312 0.830 0.559 0.259 0.363
Mean 0.564 0.883 0.682 0.300 0.575
Harmonic Mean 0.445 0.873 0.664 0.296 0.490
Table 2: Statistics for Transformer-based models. All models are trained on MNLI corpus williams-etal-2018-broad. or Max Accuracy is computed if any of the permutations per data point yield correct results. stands for the mean number of permutations which were correct when the original prediction is correct. stats for the mean number of permutations which are correct when the original prediction is incorrect (flip). is computed as the percentage of data points the ground truth label is chosen over a random uniform baseline (1/3). Bold marks the highest value per metric (red shows the model is insensitive to permutation).
Model Eval Dataset
InferSent MNLI_m_dev 0.658 0.904 0.842 0.359 0.712
MNLI_mm_dev 0.669 0.905 0.844 0.368 0.723
SNLI_dev 0.556 0.820 0.821 0.323 0.587
SNLI_test 0.560 0.826 0.824 0.321 0.600
A1_dev 0.316 0.669 0.425 0.395 0.313
A2_dev 0.310 0.662 0.689 0.249 0.330
A3_dev 0.300 0.677 0.675 0.236 0.332
Mean 0.481 0.780 0.731 0.322 0.514
Harmonic Mean 0.429 0.767 0.694 0.311 0.455
ConvNet MNLI_m_dev 0.631 0.926 0.773 0.340 0.684
MNLI_mm_dev 0.640 0.926 0.782 0.343 0.694
SNLI_dev 0.506 0.819 0.813 0.339 0.597
SNLI_test 0.501 0.821 0.809 0.341 0.596
A1_dev 0.271 0.708 0.648 0.218 0.316
A2_dev 0.307 0.725 0.703 0.224 0.356
A3_dev 0.306 0.798 0.688 0.234 0.388
Mean 0.452 0.817 0.745 0.291 0.519
Harmonic Mean 0.404 0.810 0.740 0.279 0.473
BiLSTM MNLI_m_dev 0.662 0.925 0.800 0.351 0.711
MNLI_mm_dev 0.681 0.924 0.809 0.344 0.724
SNLI_dev 0.547 0.860 0.762 0.351 0.598
SNLI_test 0.552 0.862 0.771 0.363 0.607
A1_dev 0.262 0.671 0.648 0.271 0.340
A2_dev 0.297 0.728 0.672 0.209 0.328
A3_dev 0.304 0.731 0.656 0.219 0.331
Mean 0.472 0.814 0.731 0.301 0.520
Harmonic Mean 0.410 0.803 0.725 0.287 0.463
Table 3: Statistics for Non-Transformer Models. All models are trained on MNLI corpus williams-etal-2018-broad. or Max accuracy is computed if any of the permutations per data point yield correct results. stands for the mean number of permutations which were correct when the original prediction is correct. stats for the mean number of permutations which are correct when the original prediction is incorrect (flip). is computed as the percentage of data points the ground truth label is chosen over a random uniform baseline (1/3). Bold marks the highest value per metric (red shows the model is insensitive to permutation).
RoBERTa (large) 0.784 0.988 0.726 0.339 0.773
InferSent 0.573 0.931 0.771 0.265 0.615
ConvNet 0.407 0.752 0.808 0.199 0.426
BiLSTM 0.566 0.963 0.701 0.271 0.611
Table 4: Results on evaluation on OCLI Dev set. All models are trained on OCNLI corpus hu-etal-2020-ocnli. Max accuracy () is computed based on whether any of the permutations per data point yield correct results. stands for the mean number of permutations which were correct when the original prediction is correct. stats for the mean number of permutations which are correct when the original prediction is incorrect (flip). Bold marks the highest value per metric (red shows the model is insensitive to permutation).

We present results for two types of models: (a) Transformer-based models and (b) Non-Transformer Models. In Transformer-based models, we investigate the state-of-the-art pre-trained models such as RoBertA (large) liu-et-al-2019-roberta, BART (large) lewis-etal-2020-bart as well as a relatively small DistilBERT model sanh2020distilbert. For (b) we consider several pre-Transformer era recurrent and convolution based neural networks, such as InferSent conneau2017supervised, Bidirectional LSTM collobert2008unified and ConvNet zhao2015self. We train all models on MNLI williams-etal-2018-broad, and evaluate on in-distribution (SNLI bowman-etal-2015-large and MNLI) and out-of-distribution datasets (ANLI nie-etal-2020-adversarial). We independently verified Transformer-based results on our trained model using HuggingFace Transformers wolf2020transformers, as well as pre-trained checkpoints from FairSeq ott2019fairseq

using PyTorch Model Hub. For Non-Transformer models, we use the codebase from

conneau2017supervised. We use , and use 100 seeds for the randomizations for each example in to ensure full reproducibility. We drop examples from test sets where we are unable to compute all unique randomizations, typically these are examples with sentences of length of less than 6 tokens. Code, randomized data, and model checkpoints will be released publicly.

Models accept many permuted examples.

We find on models trained on MNLI and evaluation on MNLI (in-domain generalization) is significantly high: 98.7% on MNLI dev and test sets. This shows there exists at least one permutation for almost all examples in such that the model predicts the correct answer. We also observe significantly high at 79.4%, suggesting the models outdo even a random baseline in accepting permuted, ungrammatical sentences.

Furthermore, we observe similar effects of in out-of-domain generalization on evaluating with ANLI dataset splits, where is signficantly higher than . As a consequence, we encounter many flips, where the example was originally predicted incorrectly by the model but at least one permutation for that example elicits the correct response. However, recall this analysis expects us to know the correct label upfront, so this test can be thought of as running a syntax-based stress test on the model until we reach the correct answer (or give up by exhausting our set of permutations, ).

In the case of out-of-domain generalization, reduces considerably. Probability of permuted sentences to be predicted correctly also is significantly higher for examples which were predicted correctly ( for all test splits). These two results suggest that the investigated NLU models are acting as bag-of-words models, where it is harder to find the correct permutation for already misclassified, non-generalizable sentences.

Models are very confident.

The phenomenon we observe would be of less concern if the correct label prediction was just an outcome of chance, which could occur when the entropy of the log probabilities of the model output is high (suggesting uniform probabilities on entailment, neutral and contradiction labels). We first investigate the model probabilities for the Transformer-based models on the permutations that lead to the correct answer in Figure 1

. We find overwhelming evidence that model confidences on in-distribution datasets (MNLI, SNLI) are highly skewed, resulting in low entropy, and it varies among different model types. BART proves to be the most skewed among the three types of model we consider.

To investigate whether the skewedness is a function of model capacity, we investigate the log probabilities of a lower capacity model DistilBERT on random perturbations, and find it is very similar to the RoBERTa (large) model, although, DistilBERT does exhibit lower , , and . This suggesting its inability to understand certain premise-hypothesis pairs completely due to lack of capacity.

For non-Transformers whose accuracy is lower, we observe the relative performance in the terms of (Table 3) and Average entropy (Figure 1) for these classes of models. As expected, since the non-Transformer models are significantly worse, the achieved by these models are also lower. However, while comparing the averaged entropy of the model predictions, it is clear that there is some benefit to being a worse model—the models are not as overconfident on randomized sentences as they are Transformers.

Figure 1: Average entropy of model confidences on permutations that yielded the correct results presented in a box plot, computed for Transformer-based models (top row) and Non-Transformer-based models (bottom row). Results are shown for (True) and

(False) separately. The boxes signify the quartiles of the entropy distributions.

Similar artifacts in Chinese NLU.

To conclusively verify the observation in NLI, we extended the experiments to Original Chinese NLI dataset (hu-etal-2020-ocnli, OCNLI). We re-use pre-trained RoBERTa (large) model and InferSent (non-Transformer) models on OCNLI. We find similar observations (Table 4), thereby suggesting that the phenomenon is not just an artifact of English text, but natural language understanding as a whole.

Other Results

In addition to the results presented here, we investigate the effect of length (which correlates with number of possible permutations), and the effect of hypothesis only randomization. Results are presented in the Appendix Appendix A and Appendix C.

6 Analyzing Syntactic Structure Associated with Tokens

Since all investigated models have relatively high permutation acceptance, we are lead to ask: what properties of particular permutations lead them to be accepted? We perform two initial analyses to shed light on this question. First, we ask to what extent preserving local word order despite permutation is correlated with higher Permutation Acceptance scores. We find that there is some correlation but it doesn’t fully explain the high Permutation Acceptance scores. Second, we ask whether is related to a more abstract measure of local word relations, i.e., part-of-speech (POS) neighborhood. We find that there is little effect of POS-neighbors for non-Transformer models, but that RoBERTa, BART, and DistilBERT show a distinct effect. Taken together these analyses suggest that some local word order information affects models’ Permutation Acceptance scores, and perhaps incorporating methods of decreasing model reliance on this information could be fruitful.

Preserving Local Word Order Leads to Higher Permutation Acceptance.

In our initial experiments, we randomized the word order of the sentences with the constraint that no word appear in its original position. This kind of randomization can preserve the relative positions of n-grams. To analyze the effect of this, we compute BLEU scores on 2-, 3- and 4 n-grams and compare the acceptability of the permuted sentences across different models. If preserved n-grams are driving the Permutation Acceptance effects, we should see a correlation between BLEU scores and . As a result of our permutation process, the maximum BLEU-3 and BLEU-4 scores are negligibly low ( BLEU-3 and BLEU-4), already calling into question the hypothesis that n-grams are the sole explanation for our finding. Because of this, we only compare BLEU-2 scores. (Detailed experiments on specially constructed permutations that cover the entire range of BLEU-3 and BLEU-4 is provided in Appendix D). We find that the probability of a permuted sentence to be predicted correctly by the model correlates BLEU-2 score (Figure 2). However, the base prediction of Transformer-based models is still far from random (66% for BLEU-2 range of ), and hence requires further investigation on the language processing mechanisms employed by these models.

Figure 2: Relation of BLEU-2 score against the acceptability of permuted sentences across all test datasets on four models. We observe that performance of RoBERTa and BART are surprisingly similar and can be set apart considerably from the non-Transformer-based models, such as InferSent and ConvNet.

Part-of-speech neighborhood tracks Permutation Acceptance.

Many syntactic formalisms, like Lexical Functional Grammar (kaplan-bresnan-1995-formal; bresnan-etal-2015-lexical, LFG), Head-drive Phrase Structure Grammar (pollard-sag-1994-head, HPSG) or Lexicalized Tree Adjoining Grammar (schabes-etal-1988-parsing; abeille-1990-lexical, LTAG), are “lexicalized”, i.e., individual words or morphemes bear syntactic features telling you which other things they can combine with. Taking LTAG as an example, two lexicalized trees would be associated with the verb “buy”, one which projects a phrase structure minitree containing two noun phrases (for the subject and direct object, as in Kim bought a book), and one which projects a minitree containing three (one for the subject, another for the direct object, and a third for the indirect object, as in Kim bought Logan a book). In this way, an average tree family for any particular word would provide information about what sort of syntactic contexts the word generally appears in, or roughly which syntactic neighbors the word has. Taking inspiration from lexicalized grammatical formalisms, we speculate that our NLI models might be performing well on permuted examples because they are reconstructing perhaps noisily the word order of a sentence from its words.

To test this, we operationalized the idea of a lexicalized minitree. First, we POS tagged every example in the corpus using 17 Universal Part-of-Speech tags (using spaCy, spacy2). For each , we compute the occurrence probability of POS tags on tokens in the neighborhood of for each in containing . The neighborhood is specified by the radius (a symmetrical window tokens from to the left and right). We denote this sentence level probability of POS tags for a word as (see Figure 6). These sentence-level word POS neighbor scores can be averaged to get a corpus-level POS tag minitree probability (i.e., a type level score). Then, for a sentence , for each word , we compute a POS minitree overlap score as follows:


Concretely, computes the overlap of top- POS tags in the neighborhood of a word with that of the train statistic. If a word has the same mini-tree in the permuted sentence as is the training set, then the overlap would be 1. For a given sentence , the aggregate is defined by the average of the overlap scores of the constituent words: , and we call it a POS minitree signature.

Now, we can compute the same score for a permuted sentence to have . The idea is if the permuted sentence POS signature comes close to the true sentence, then the ratio of will be close to 1. If the ratio is 1, that suggests the permuted sentence overlaps more than the original sentence with the train statistic. Put in other words, if high overlap correlates with percentage of permutations deemed correct (even in randomized sentences), then our models treat words as if they bear syntactic minitrees. Therefore, for where many of the permutations have high average POS minitree overlap score, we should expect a higher prediction accuracy.

Figure 3: Comparison of POS Tag Mini Tree overlap score with the percentage of permutations deemed correct by the models.

We investigate the relationship with percentage of permuted sentences accepted with in Figure 3. We observe that the POS Tag Minitree hypothesis holds for Transformer-based models, RoBERTa, BART and DistilBERT, where the percentage of accepted pairs increase as the sentences have higher overlap with the un-permuted sentence in terms of POS signature. For non-Transformer models such as InferSent, ConvNet and BiLSTM models, the POS signature ratio to percentage of correct permutation remains the same or decreases, suggesting that the reasoning process employed by these models does not preserve local abstract syntax structure (i.e., POS neighbor relations).

7 Maximum Entropy Training

Here, we propose an initial attempt to mitigate the effect of correct prediction on permuted examples. We observed that the log probabilities of the output of a model on permuted examples are significantly higher than random. This kind of phenomenon has been observed prior in Computer Vision

gandhi2019mutual, and suggests model struggle to to learn mutually exclusively. Neural networks tend to output higher confidence than random for even unknown inputs, which might be an underlying cause of the high Permutation Acceptance phenomenon in our NLI models.

Figure 4: Effect of maximizing entropy training on RoBERTa (large)
Eval Dataset (V) (ME) (V) (ME)
mnli_m_dev 0.905 0.908 0.984 0.328
mnli_mm_dev 0.901 0.903 0.985 0.329
snli_test 0.882 0.888 0.983 0.329
snli_dev 0.879 0.887 0.984 0.333
anli_r1_dev 0.456 0.470 0.890 0.333
anli_r2_dev 0.271 0.258 0.880 0.333
anli_r3_dev 0.268 0.243 0.892 0.334
Table 5: NLI Accuracy () and Permutation Acceptancemetrics () of RoBERTa when trained on MNLI dataset using vanilla (V) and Maximum Random Entropy (ME) method

Since our ideal model would be ambivalent about the randomized ungrammatical sentences, we devise a simple objective to train NLU models by baking in the Mutual Exclusivity principle through Maximizing Entropy. Concretely, we train RoBERTa model to do well on the MNLI dataset while maximizing the entropy () on a subset of randomized examples () for each example (). We use a

randomizations for each example, and modify the loss function as follows:

L=argmin_θ∑_((p, h),y)ylog(p(y—(p,h);θ)) + ∑_i=1^n H(y—(^p_i,^h_i);θ) Using this simple maximum entropy method, we find that the model improves considerably with respect to its robustness to randomized sentences (Figure 4), all without taking a hit in the model accuracy (Table 5). We observe that none of the models reach score close to 0, suggesting further room to explore other methods for decreasing models’ Permutation Acceptance.

8 Human Evaluation

Evaluator Accuracy Macro F1 Acc on Acc on
X 0.581 0.454 0.649 0.515
Y 0.378 0.378 0.411 0.349
Table 6: Human evaluation on 200 permuted sentence pairs from MNLI Matched development set williams-etal-2018-broad using two NLI experts. Half of the permuted pairs contained shorter sentences and the other, longer ones. Both experts were provided only the permuted sentences (not the original example or the label) and were disallowed from consulting with one another. All permuted sentences were predicted correctly by RoBERTa (large).

Since our models often accept permuted sentences, we ask how humans perform unnatural language inference on permuted sentences. We expect humans to struggle with the task, given our intuitions and the sentence superiority findings, but to test this, we presented two experts in Natural Language Inference (one a linguist) with a random sample of 200 permuted sentence pairs, and asked them to predict the entailment relation. The experts were provided with no information about the examples from which the permutations were drawn (above and beyond the common knowledge that NLI is usually defined as a roughly balanced 3-way classification task). Unbeknownst to the experts, all permuted sentences in the sample were actually deemed to be accepted by RoBERTa model (trained on MNLI dataset). We observe that the experts performed much worse than RoBERTa (Table 6), although their accuracy was a bit higher than random. In a second sample, and again, unbeknownst to the experts, we provided permuted sentence pairs from the MNLI Matched Dev. set some of which were originally predicted correctly and some incorrectly by RoBERTa (large) model. We find that for both experts, accuracy on permutations from originally correct examples was higher than on the incorrect examples, which verifies the dasgupta-etal-2018-evaluating, gururangan-etal-2018-annotation, naik-etal-2019-exploring that word overlap is important.

9 Future Work & Conclusion

While we have shown that classification labels can be flipped based solely on a sentence reordering, future work could also explore the relationship between permutation and deletion. Although our results tentatively support the hypothesis that current models do not know “know syntax” in a human-like way (according to our definition) and are mostly just sensitive to words (or perhaps n-grams), they are preliminary, and future work is required to fully understand human-like classification of permuted NLI examples.

In this work we show that state-of-the-art models does not rely on sentence structure the way we think they should. On the task of Natural Language Inference, we show that models (both Transformer-based models, RNNs, and ConvNets) are largely insensitive to permutations of word order that corrupt the original syntax. This raises questions about the extent to which such systems understand “syntax”, and highlights the unnatural language understanding processes they employ.

A few years ago, manning-etal-2015-computational encouraged NLP to consider “the details of human language, how it is learned, processed, and how it changes, rather than just chasing state-of-the-art numbers on a benchmark task.” We expand upon this view, and suggest one particular future direction: we should train models not only to do well, but also to not to overgeneralize to corrupted input.


Thanks to Shagun Sodhani, Hagen Blix, Ryan Cotterell, Emily Dinan, Nikita Nangia, Grusha Prasad, and Roy Schwartz for many invaluable comments and feedback on early drafts.


Appendix A Effect of Length on Permutation Acceptance

We investigate the effect of length on Permutation Acceptancein Figure 5. We observe that shorter sentences in general have higher probability of acceptability for examples which was originally predicted correctly - since shorter sentences have less number of unique permutations. However, for the examples which were originally incorrect, the trend is not present.

Figure 5: Length in Transformer-based models

Appendix B Example of POS Minitree

As we defined in § 6, we develop a POS signature for each word in a sentence in a test set, comparing it with the distribution of the same word in the training set. Figure 6 provides a snapshot a word “river” from the test set and how the POS signature distribution of the word in a particular sentence match with that of aggregated training statistic. In practice, we the top k tags for the word in test signature as well as the train, and calculate the overlap of POS tags. When comparing the model performance with permuted sentences, we compute a ratio between this overlap score and the overlap score of the permuted sentence. In the Figure 6, ‘river’ would have a POS tag minitree score of 0.75.

Figure 6: Example POS signature for the word ‘river’, calculated with a radius of 2. Probability of each neighbor POS tag is provided. Orange examples come from the permuted test set, and blue come from original train.
Figure 7: threshold for all datasets with varying and computing the percentage of examples that fall within the threshold. The top row consists of in-distribution datasets (MNLI, SNLI) and the bottom row contains out-of-distribution datasets (ANLI)

Appendix C Effect of Hypothesis only randomization

Figure 8: Comparing the effect between randomizing both premise and hypothesis and only hypothesis on two Transformer-based models, RoBERTa and BART (For more comparisons please refer to Appendix). In 8, we observe the difference of is marginal in in-distribution datasets (SNLI, MNLI), while hypothesis-only randomization is worse for out-of-distribution datasets (ANLI). In 8, we compare the mean number of permutations which elicited correct response, and naturally the hypothesis-only randomization causes more percentage of randomizations to be correct.

In recent years, the impact of the hypothesis sentence (gururangan-etal-2018-annotation; tsuchiya-2018-performance; poliak-etal-2018-hypothesis) on NLI classification has been a topic of much interest. As we define in § 3, logical entailment can only be defined for pairs of propositions. We investigated one effect where we randomize only the hypothesis sentences while keeping the premise intact. We find (Figure 8) that the value is almost similar for the two schemes, suggesting that even with randomizing the hypothesis the model can exhibit similar phenomenon.

Appendix D Effect of clumped words in random permutations

Figure 9: Relation of BLEU-2/3/4 scores against the acceptability of clumped-permuted sentences accross all test datasets on all models.

Since our original permuted dataset consists of extremely randomized words, we observe very low BLEU-3 ( 0.2) and BLEU-4 scores ( 0.1). To study the effect of overlap across a wider range of permutations, we devised an experiment where we clump certain words together before performing random permutations. Concretely, we clump 25%, 50% and 75% of the words in a sentence and then permute the remaining words and the clumped word as a whole. This type of clumped-permutation allows us to study the full range of BLEU-2/3/4 scores, which we present in Figure 9. As expected, the acceptability of permuted sentences increase linearly with BLEU score overlap.

Appendix E Effect of the threshold of in various test splits

We defined two variations of , and , but theoretically it is possible to define any arbitrary threshold percentage to evaluate the unnatural language inference mechanisms of different models. In Figure 7 we show the effect of different thresholds, including where and where . We observe for in-distribution datasets (top row, MNLI and SNLI splits), in the extreme setting when , there are more than 10% of examples available, and more than 25% in case of InferSent and DistilBERT. For out-of-distribution datasets (bottom row, ANLI splits) we observe a much lower trend, suggesting generalization itself is the bottleneck in permuted sentence understanding.