Log In Sign Up

Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes

Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language understanding (NLU) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on understanding whether these models respond to perturbed multiple-choice instances in a sufficiently robust manner that would allow them to be trusted in real-world situations. We present four confusion probes, inspired by similar phenomena first identified in the behavioral science community, to test for problems such as prior bias and choice paralysis. Experimentally, we probe a widely used transformer-based multiple-choice NLU system using four established benchmark datasets. Here we show that the model exhibits significant prior bias and to a lesser, but still highly significant degree, choice paralysis, in addition to other problems. Our results suggest that stronger testing protocols and additional benchmarks may be necessary before the language models are used in front-facing systems or decision making with real world consequences.


page 1

page 2

page 3

page 4


Do Fine-tuned Commonsense Language Models Really Generalize?

Recently, transformer-based methods such as RoBERTa and GPT-3 have led t...

Can Language Representation Models Think in Bets?

In recent years, transformer-based language representation models (LRMs)...

Compositional Language Understanding with Text-based Relational Reasoning

Neural networks for natural language reasoning have largely focused on e...

Learning Chess Blindfolded: Evaluating Language Models on State Tracking

Transformer language models have made tremendous strides in natural lang...

A Multiple-Choice Test Recognition System based on the Gamera Framework

This article describes JECT-OMR, a system that analyzes digital images r...

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Recent advances in Natural Language Processing (NLP), and specifically a...

Understanding Prediction Discrepancies in Machine Learning Classifiers

A multitude of classifiers can be trained on the same data to achieve si...

1 Background

Question Answering (QA) Hirschman and Gaizauskas (2001)

and inference are important problems in natural language processing (NLP) and applied AI, including development of conversational ‘chatbot’ agents

Siblini et al. (2019). Developments over the last five years in deep neural transformer-based models have led to significant improvements in QA performance, especially in the multiple-choice setting. Bidirectional Encoder Representations from Transformers (BERT) Devlin et al. (2018) is a neural transformer-based model that was pre-trained by Google and that consequently achieved state-of-the-art performance in a range of NLP tasks, including QA and Web search. BERT is designed to help computers understand the meaning of ambiguous language in the text by using the surrounding text to establish context, and depends on capabilities such as bidirectional encoding capability, masked language modeling (MLM) and next sentence prediction.

BERT, and other models based on BERT, such as Patentbert Lee and Hsiang (2019), Docbert Adhikari et al. (2019), SciBERT Beltagy et al. (2019), DistilBERT Sanh et al. (2019) and K-bert Liu et al. (2020), have achieved groundbreaking results in diverse language understanding tasks, including QA Reddy et al. (2019); Fan et al. (2019); Lewis et al. (2019)

, text summarization

Liu and Lapata (2019); Zhang et al. (2019a), sentence prediction Shin et al. (2019); Lan et al. (2019), dialogue response generation Zhang et al. (2019b); Wang et al. (2019), natural language inference McCoy et al. (2019); Richardson et al. (2020), and sentiment classification Gao et al. (2019); Thongtan and Phienthrakul (2019); Munikar et al. (2019). The model studied in this paper, RoBERTa Liu et al. (2019b)

, is a highly optimized version of the original BERT architecture that was first published in 2019 and improved over BERT on various benchmarks by margins of 0.9 [on the Quora Question Pairs dataset

Iyer et al. (2016)] - 16.2 percent [on the Recognizing Textual Entailment dataset Dagan et al. (2005); Haim et al. (2006); Giampiccolo et al. (2007); Bentivogli et al. (2009)].

Specifically, RoBERTa is trained with larger mini-batches and learning rates, removes the next-sentence pre-training objective, and focuses on improving the MLM objective to deliver improved performance, compared to BERT, on problems such as Multi-Genre Natural Language Inference Williams et al. (2017), and Question-Based Natural Language Inference Rajpurkar et al. (2016). RoBERTa-based models have approached near-human performance on various (subsequently described) commonsense natural language understanding (NLU) benchmarks.

BERT’s original success on these NLU tasks has also motivated researchers to adapt it for multi-modal language representation Lu et al. (2019); Sun et al. (2019), cross-lingual language models Lample and Conneau (2019), and domain-specific language models, including in the medicine- Alsentzer et al. (2019); Wang et al. (2020) and biology-related domains Lee et al. (2020). Due to this widespread use, and the fact that even recent, more advanced models based on billions of parameters are based on similar technology (deep transformers), it has become important to systematically study the linguistic properties of BERT using a battery of tests inspired by work first conducted in the behavioral sciences. In prior work, for example, several proposed approaches aimed to study the knowledge encoded within BERT, including fill-in-the-gap probes of MLM Rogers et al. (2020); Wu et al. (2019), analysis of self-attention weights Kobayashi et al. (2020); Ettinger (2020)

, the probing of classifiers with different BERT representations as inputs

Liu et al. (2019a); Warstadt and Bowman (2020), and a ‘CheckList’ style approach to systematically evaluate the linguistic capability of a BERT-based model Ribeiro et al. (2020). Evidence from this line of research suggests that BERT encodes a hierarchy of linguistic information, with surface features at the bottom, syntactic features in the middle and semantic features at the top Jawahar et al. (2019). It ‘naturally’ learns syntactic information from pre-training text.

However, it has been found that while information can be recovered from its token representation Wu et al. (2020), it does not fully ‘understand’ naturalistic concepts like negation, and is insensitive to malformed input Rogers et al. (2020). The latter is similar to adversarial

experiments (not dissimilar to adversarial experiments in the computer vision community) that researchers have conducted to test BERT’s robustness. Some of these experiments have shown that, even though BERT encodes information about entity types, relations, semantic roles, and proto-roles well, it struggles with the representations of numbers

Wallace et al. (2019) and is also brittle to named entity replacements Balasubramanian et al. (2020). Besides, Shen and Kejriwal (2021) also found clear evidence that fine-tuned BERT-based language representation models still do not generalize well, and may, in fact, be susceptible to dataset bias.

We describe a novel set of systematic confusion probes to test linguistically relevant properties of a standard, and currently widely-used, multiple-choice NLU system based on RoBERTa111Further detailed, with links to the publicly available code, in Methods.. Our probes, described below, are not only replicable, but can be extended to other benchmarks and even newer language representation models as we show through preliminary additional experiments (described in Discussion) involving the recent T5-11B model. Unlike much of the prior work on this subject, we are not seeking to understand the layers of a specific network or how it encodes knowledge, but rather, to understand the commonsense properties of these models. A clear understanding of such properties allows us to test whether such language models, which are continuing to be rolled out in commercial products, are truly answering questions in a robust manner, or are disproportionately impacted by problems such as prior bias and choice paralysis. We subsequently define these notions more precisely, but intuitively, prior bias occurs when a language representation model has a consistent and statistically significant preference for selecting an incorrect candidate choice over another. Such a bias is usually undesirable, as it indicates the model may be amenable to being ‘tricked’ e.g., by introducing perturbations of the kind explored in this paper.

Choice paralysis, on the other hand, occurs when the preference of the model for the correct candidate choice significantly and consistently diminishes as more (incorrect) choices are offered to a model in response to a prompt. Choice paralysis is inspired by a similarly named phenomenon in the behavioral and decision sciences222Other common names include overchoice and choice overload. A related (although somewhat broader) problem in decision sciences is analysis paralysis Lenz and Lyles (1985), wherein it was found that giving people too many options can make it more difficult for them to choose between them. We experimentally test whether an analogous problem is observed in multiple-choice NLI QA systems Schwartz (2004); Kahneman (2011).

Before proceeding with working definitions of choice paralysis and prior bias, we introduce some basic formalism for placing the remainder of the paper in context. First, let us define an instance as a pair composed of a prompt , and a set of candidate choices. We clarify the reasons for this terminology in Methods, but intuitively, we use prompt (rather than question) because the input may not be a proper question333In general, commonsense benchmarks are NLU benchmarks, which may involve QA, but do not have to. In some cases, the task is abductive reasoning, while in others, the task is NLI or even goal satisfaction, as in the case of the subsequently described Physical IQA benchmark.. An example of such a case is provided in the center of Fig 1.

Figure 1: An example (from the real-world abductive NLI benchmark) of the four confusion probes used in this paper as perturbation-based interventions and detailed on the next page. Prompt-based interventions are shown at the top, and choice-based interventions are shown at the bottom. In aNLI, the prompt in an instance comprises two observations, and the candidate choices are the two hypotheses, of which exactly one is considered correct in the original unperturbed instance.

Given an instance , we assume that exactly one of the choices is correct. Given a language representation model (such as one based on RoBERTa) that is designed to handle multi-choice NLI instances, we assume the output of , given , to be , where is the model’s predicted choice, and is a confidence set that includes the model’s confidence per candidate choice

. We denote the variance of

(calculated per instance) as . We say that is perturbed either if is changed in some manner (including being assigned the ‘empty string’) or if is modified through addition, deletion, substitution or any other modification of candidate choices. Finally, if a perturbation applied on results in the perturbed instance not having any theoretically correct choice in response to the prompt444We discuss four specific types of perturbations or ‘confusion probes’ used in our experimental study in the next section, but for present purposes, we note that these definitions and formalism apply regardless of the form of the perturbation itself., but is still a candidate choice, we refer to as the pseudo-correct choice. An obvious example of when this occurs is a perturbation that ‘deleted’ the prompt by assigning it the empty string. Since there is no prompt, none of the candidate choices are theoretically correct or incorrect. Assuming that was not modified, the pseudo-correct choice would be .

With these basic preliminaries in place, we define the prior bias of a multiple-choice NLI model with respect to a perturbed instance below:

Definition 1 (Prior Bias). Given an original instance (with correct choice ), a perturbation of that instance that does not have any correct choice, and a multiple-choice NLI model that respectively outputs and given and , we define as the prior bias of with respect to .

Note that Definition 1 above only applies to perturbed instances where there is no correct choice (although there may potentially be a pseudo-correct choice , depending on the specific type of perturbation applied). Definition 1 can also be generalized to quantify the prior bias of on an NLI benchmark (with respect to a specific perturbation) by aggregating

across all perturbed instances in the benchmark. Only if the null hypothesis that

cannot be rejected at a given level of confidence can we say with statistical certitude that does not have prior bias on that benchmark, with respect to the applied perturbation.

Next, using the same formalism, we can define choice paralysis in the context of multiple-choice NLI models:

Definition 2 (Choice Paralysis). Given an original instance (with correct choice ), a perturbation of that instance (with , , and being the correct choice in response to prompt ), and a multiple-choice NLI model , is said to have choice paralysis with respect to if the confidence of in is significantly lower than the confidence of in . Denoting these two confidences respectively as and , the magnitude of choice paralysis is given by .

Note that the direction of the subtraction matters i.e., can theoretically have negative choice paralysis whereby its confidence in the correct answer actually increases when a specific perturbation introduces a choice-set that is larger than the original choice-set. However, one important aspect that we note about Definition 2 is that does not have to be a super-set of , although it is required to be larger, and at minimum, must contain the correct choice , similar to . Furthermore, while there is no restriction on also perturbing the prompt (to a new prompt ), the perturbation must not be such that the theoretically correct answer changes. In practice, as we subsequently describe, our perturbation functions operate either at the level of the prompt, or choice-set, but not both.

Finally, we note that, although both Definitions 1 and 2 impose some constraints on the types of perturbations that can be applied as interventions on the original instances in a benchmark, they can work with any perturbation functions that adhere to these constraints. For instance, as noted earlier, Definition 1 can be used to measure prior bias as long as there is no theoretically correct choice in the perturbed instance. While it may be possible to modify the definition to also measure prior bias if this were not the case, we leave such an expanded definition and its empirical validation to future research. Conversely, Definition 2 assumes that, even after the perturbation, the instance continues to contain a single theoretically correct choice in its (expanded) candidate choice-set in response to the prompt.

Since the definitions do not dictate specific perturbation functions, in order to conduct experiments, we need to devise one or more perturbation functions that enable us to quantify these phenomena for real-world multi-choice NLI benchmarks, and a sufficiently powerful language representation model that can handle not only the original benchmarks, but also their perturbed versions. Next, we describe the four perturbation functions used in this paper for studying these phenomena.

1.1 Perturbation Methodology: Prompt-Based and Choice-Based Confusion Probes

We designed a set of four perturbation functions, also called confusion probes, that operate by systematically transforming multiple-choice NLI instances in four publicly available benchmarks, which have been widely used in the literature for assessing machine commonsense performance. These four benchmarks test the ability of a language representation model to select the best possible explanation for a given set of observations [aNLI Bhagavatula et al. (2020, 2019)], do grounded commonsense inference [HellaSwag Zellers et al. (2019b, a)], reason about both the prototypical use of objects and non-prototypical, but practically plausible, use of objects [PIQA Bisk et al. (2020a, b)], and answer social commonsense questions [SocialIQA Sap et al. (2019a, b)].

Each of the four confusion probes intervenes either at the level of the prompt, or the candidate choices, but not both. As noted earlier, an instance comprises both a prompt and a candidate choice-set. The prompt may or may not be an actual ‘question’, as understood grammatically. For example, as shown at the center of Fig 1, a single QA instance in the aNLI benchmark consists of two observations (the prompt) and two hypotheses (the choices or answers), of which one must be selected as being the best possible explanation for the given observations (abductive reasoning). In each of the four benchmarks, the structure of the instance is fixed, including the way in which the question is presented and the number of choices. The structure of instances in all four benchmarks, with an illustrative example each, is further detailed in Methods.

We designed two prompt-based interventions (No-Question probe and Wrong-Question probe), and two choice-based interventions (No-Right-Answer probe and Choice Paralysis probe), named intuitively and defined below. The No-Question intervention has been used in many other NLP tasks. For example, Kaushik and Lipton (2018) used this probe to test the difficulty of reading comprehension benchmarks. The examples in these benchmarks are tuples consisting of question, passage, and answer. In their experiments, they analyzed the model’s performance on various benchmarks when the test examples were provided with question-only or passage-only information (but not both). Similarly, in an NLI task, Gururangan et al. (2018); Poliak et al. (2018) re-evaluated high-performing models on hypothesis-only examples. In their experiments, models predicted the label (‘entailment’, ‘neutral’ or ‘contradict’) of a given hypothesis without seeing the premise. The No-Question probe in our experiment is similar to the hypothesis-only model, which only provides models of multiple-choice choices without prompts.

All our probes are visualized using an actual instance from a benchmark in Fig 1. Implementation details on how these benchmarks were manipulated to achieve these interventions are provided in Methods, and can be used for replication.

  1. No-Question probe: This intervention probes the model by removing the prompt altogether and only retaining the candidate choice-set555In a slight abuse of terminology, and for reasons of maintaining compatibility with the QA literature where the use of the terms ‘prompt’ and ‘instance’ are non-standard, we use the (more specific) terms ‘question’ and ‘answer’ in the names of the probes, although in the main text we continue to use the (broader, but more accurate) terminology introduced in the early part of the paper..

  2. Wrong-Question probe: Similar to the No-Question probe, this probe retains the original candidate choice-set for an instance, but ‘swaps’ the original prompt in an instance with a prompt from another instance (in the same benchmark) that is not relevant to any of the answers.

  3. No-Right-Answer probe: This probe retains the prompt and all incorrect choices in an instance, but replaces the correct choice with a choice from another instance’s choice-set. The model is thus presented with a prompt but no correct choice in response to the prompt (among the presented choices). Note that, in the Wrong-Question probe, the candidate choices are completely unrelated to the corresponding prompt whereas in the No-Right-Answer probe, the non-substituted choices are often found to be semantically related to the prompt666For example, in Fig 1, the original prompt contains two Kat-related observations. Following the Wrong-Question intervention, the instance is given two Tom-related observations, which are completely unrelated to the two original Kat-related hypotheses. In contrast, following the No-Right-Answer intervention, the observations remain the same, but the original (correct) Kat-related hypothesis is substituted by a Jim-related hypothesis. Semantically speaking, the substituted hypothesis is less related to the observations than the non-substituted (but still incorrect) one, which mentions Kat.. Theoretically, since all the choices are incorrect (in either probe), an ideal model would not exhibit higher prior bias for this probe, compared to the Wrong-Question probe, by getting confused by the misleading choice; however, we experimentally test whether this is indeed the case for the multiple-choice NLI system.

  4. Choice Paralysis probe: Different from the three probes above, this probe is designed to test choice paralysis, as defined earlier. This probe ‘expands’ the set of choices significantly using a systematic methodology that requires a parameter specifying the total number of choices to present to the model.

    must be higher than the number of choices originally presented in an instance per prompt. We rely on one of two sampling mechanisms to decide how the ‘incorrect’ choices (the choices other than the correct choice) are selected from the other instances in the benchmark. The first of these is simple random sampling, while the second is ‘heuristic’ sampling. In each of these, the first common step is to eliminate the original incorrect choices, followed by sampling the

    incorrect choices from the set of all correct choices from the other instances. When using simple random sampling, we sample the

    choices randomly, and with uniform probability, from that set. When using heuristic sampling, we select the

    choices that have greatest cosine similarity to the prompt in the RoBERTa sentence-embedding space. The latter, while not explicitly designed to be adversarial, is expected to lead to more confusion for the model as it selects answer choices that are somewhat related to the prompt in the embedding space, but that are still incorrect.

We note that a unique aspect of all of our probes (with the exception of choice paralysis) is that the performance, as measured using pseudo-accuracy (defined simply as the fraction of test-set instances where the model picked the pseudo-correct choice), should ideally decline post-intervention. This ideal performance (which is the same as random performance) occurs if the model has no prior bias. This makes the first three confusion probes different from other such experiments evaluating language models. For example, we would ideally want a model to pick among choices randomly when it is not presented with a prompt (No-Question probe). However, our probe and methodology is able quantify the magnitude (and statistical significance) of this bias. Finally, for all probes and benchmarks, we conduct, and interpret the results of, each experiment not only by comparing the model’s post-intervention pseudo-accuracy performance to its original performance (on the unperturbed instances), but also by comparing its pseudo-accuracy to the expected performance that would be achieved by a system without any prior bias.

2 Methods

2.1 Evaluation Datasets

The four benchmarks used in the experimental study are described below, with references for further reading. We also provide a representative example in Fig 2. We emphasize that an instance is the combination of the prompt and the candidate choice-set, and only if the prompt is an actual question, should the instance technically be thought of as a QA instance. In the general case, each instance should be thought of broadly as testing natural language understanding. In the rest of the discussion, we continue using the proper terms ‘instance’, ‘choice’ and ‘prompt’ (rather than the somewhat inaccurate terms ‘QA instance’, ‘question’ and ‘answer’, respectively) to refer to these concepts.

  1. aNLI (Abductive Natural Language Inference): Abductive Natural Language Inference (aNLI) Bhagavatula et al. (2020, 2019) is a commonsense benchmark dataset designed to test an AI system’s capability to apply everyday abductive reasoning to deduce possible explanations for a given set of observations. Formulated as a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations (from narrative contexts). The combination of the two observations (provided as input in a given instance) is considered the prompt. The benchmark contains 169,654 instances in the training set and 1,532 instances in the development (dev.) set, which is used as ‘test set’ in experiments. Compared to human performance of 0.93, the highest performance of a language model (at the time of writing) is 0.90, which is achieved by DeBERTa (Decoding-enhanced BERT with disentangled attention), produced by Microsoft Dynamics 365 AI.

  2. HellaSwag: HellaSWAG Zellers et al. (2019b, a) is a dataset for studying grounded commonsense inference. It consists of 49,947 multiple-choice instances about ‘grounded situations’ (with 39,905 instances in the training set and 10,042 instances in dev. set). Each prompt comes from one of two domains–Activity Net or wikiHow–with four candidate choices about what might happen next in the scene. The correct choice is the (real) sentence for the next event; the three incorrect choices are adversarially generated and human-verified, ensuring a non-trivial probability of ‘fooling’ machines but not (most) humans. Each HellaSwag instance provides two contexts as the prompt. UNICORN 777, a model based on T5, achieves the current highest performance (0.94) of models on this benchmark, which approaches human performance (0.96).

  3. PIQA: Physical Interaction QA (PIQA) Bisk et al. (2020a, b) is a novel commonsense QA benchmark for naïve physics reasoning, primarily concerned with testing machines on how humans interact with everyday objects in common situations. It tests, for example, what actions a physical object ‘affords’ (e.g., it is possible to use a cup as a doorstop), and also what physical interactions a group of objects afford (e.g., it is possible to place an computer on top of a table, but not the other way around). The dataset requires reasoning about both the prototypical use of objects (e.g., glasses are used for drinking) but also non-prototypical (but practically plausible) uses of objects. There are 16,113 instances in PIQA’s training set and 1,838 instances in its dev. set. The goal in every PIQA instance is the prompt in our experiments. Compared to human accuracy (0.95), the machine’s best performance is 0.90, which is also achieved by UNICORN.

  4. Social IQA: Social Interaction QA Sap et al. (2019a, b) is a QA benchmark for testing social common sense. In contrast with prior benchmarks focusing primarily on physical or taxonomic knowledge, Social IQA is mainly concerned with testing a machine’s reasoning capabilities about people’s actions and their social implications. Actions in Social IQA span many social situations, and candidate choices comprise of both human-curated answers and ‘adversarially-filtered’ machine-generated choices. Social IQA contains 33,410 instances in its training set, and 1,954 instances in its dev. set. While Social IQA separates the context from the question, the two collectively constitute the prompt, in keeping with the terminology mentioned earlier. We note that both human- and machine-performance on Social IQA are slightly lower than other benchmarks. Specifically, human accuracy on Social IQA is 0.88, with UNICORN achieving an accuracy of 0.83.

Figure 2: Instances (prompt and candidate choice-set) from the four commonsense benchmark datasets used for the experimental study herein. The prompt is highlighted in yellow, and the correct choice is in blue.

2.2 RoBERTa-based Model

Transformer-based models have rapidly emerged as state-of-the-art in the natural language processing community, both for specific tasks like question answering, but also for deriving ‘contextual embeddings’. BERT is a bi-directional transformer that can be pre-trained over a lot of unlabeled textual data to learn a language representation model. This model can, in a second step that does not typically use as much data as the first step, be fine-tuned

for specific machine learning tasks.

RoBERTa is a more optimized re-training of BERT that removes the Next Sentence Prediction task from BERT’s pre-training, while introducing dynamic masking so that the masked token changes during the training epochs. Larger batch-training sizes were also found to be more useful in the training procedure. Unlike a more recent model like GPT-3, a pre-trained version of RoBERTa is fully available for researchers to use and can be fine-tuned for specific tasks

Liu et al. (2019b).

Unsurprisingly, many of the systems occupying a significant fraction of top leaderboard888 positions for the commonsense reasoning benchmarks described earlier are based on RoBERTa (or some other optimized BERT-based model) in some significant manner. All experiments in this paper use a publicly available RoBERTa Ensemble model 999 that was not developed by any the authors, either in principle or practice, and that can be downloaded and replicated very precisely. It is also worth noting that even recent models that have superseded RoBERTa (such as T5) on the benchmarks are based on transformers as well. In the Discussion section, we showed that (at least one of) the newer models may also exhibit biases when subject to the full set of confusion probes.

The RoBERTa Ensemble model is fine-tuned on each benchmark’s respective training set and evaluated on their dev. set to test the model’s performance. Each such trained model was verified to achieve over 80% performance (on average) over the four benchmarks, and performance is not substantively different from the state-of-the-art performance recorded on the current leaderboards. This fine-tuned model will be re-used in the experiments below to evaluate post-intervention instances after applying the different (previously described) probes. We do not re-fine-tune the model post-intervention.We also note that the model continues to work (i.e., successfully makes a prediction) even for the Choice Paralysis confusion probe, where there is a mismatch between the training and evaluation set formats, due to instances in the evaluation set having more candidate choices compared to the training set. The only syntactic modification that was needed was to input to the model the number of choices in the candidate choice-set (presented to it per instance in the test phase), from which it needed to predict the correct choice. The model itself did not have to be re-fine-tuned.

2.3 Instance Perturbations Using Confusion Probes

Earlier, we described the four confusion probes (No-Question, Wrong-Question,No-Right-Answer, and Choice Paralysis) that we rely on for our experimental study. Below, we provide additional details relevant to their setup and implementation for the RoBERTa Ensemble model:

  1. No-Question: For each instance, we remove the prompt and make the RoBERTa-Ensemble model select a choice from its original choice-set without any contexts or prompts. Note that the RoBERTa-Ensemble model is capable of of accepting an empty string as prompt.

  2. Wrong-Question: For each instance, we replace the original prompt with a mismatched

    prompt (called the ‘pseudo-prompt’). The pseudo-prompt for a given instance is an actual prompt from another randomly selected instance in that benchmark. No change is made to the choice-set. While there is a very small probability that the pseudo-prompt may be correctly answered by a choice from the unmodified choice-set, in practice, we could find no such cases when we randomly sampled and manually checked (post-intervention) 25 instances from each benchmark. As a further robustness check, we conducted another experiment where, instead of randomly sampling an instance from which the pseudo-prompt was selected, we only considered sampling from instances where the prompt did not share any words with the original prompt. The experimental results were not found to change appreciably compared to the simpler replacement protocol described above. Hence, we only report those results for that protocol. Furthermore, to account for randomness, we repeat the experiment five times (per benchmark), and report average performance with standard errors. In the ideal situation (where the model would not get confused and exhibit prior bias), the model should still randomly choose an option from the choice-set, similar to its ideal response on the

    No-Question probe.

  3. No-Right-Answer: For this probe, the choice-set is modified rather than the prompt. Namely, each instance’s correct choice is substituted with a mismatched choice (called the ‘pseudo-choice’). Similar to the Wrong-Question replacement protocol, we randomly sample an instance and substitute the ‘correct’ choice of the given instance with the correct choice of the randomly sampled instance. We conduct similar robustness checks (and also manual checks) as with the Wrong-Question protocol, but again found the results to be similar to that of the described protocol. Hence, only those results are reported. We also account for randomness through five experimental trials. Without a correct answer for the prompt, the ideal model would be expected to randomly pick a choice from all the presented (and all incorrect) choices.

  4. Choice Paralysis: The prompts of instances are kept unchanged, but the choices are expanded significantly. There are two ways to extend the choice-set. Assuming that the number of options for each instance is a parameter (with in the experiments reported herein), for each target instance, we randomly pick other instances at first. The correct answers of these instances (one per instance), together with the correct answer of the target instance, constitute the choices. Instead of randomly picking the instances, we also experimented with a heuristic approach that picks out instances that have the prompts most similar to the prompt of the target instance. The similarity between two prompts is computed by calculating the cosine similarity between the sentence embeddings of the prompts. We randomly pick out 50 modified instances from the original dev. set as a new dev. set for evaluation and repeat this evaluation five times to account for randomness. A smaller dev. set is considered for this experiment due to the significant increase in computational time incurred per prompt (for the language representation model) when the choice-set is expanded. We report both the average and standard errors. Although the choices have been extended, the original prompt and the correct answer (for the original prompt) are retained; hence, an ideal model that does not suffer from choice paralysis is expected to choose the correct answer.

3 Results

3.1 Question-Based Interventions

3.1.1 No-Question Probe

Figure 3: A boxplot summarizing the confidences of correct (and ‘pseudo-correct’) options before (denoted using _Ori) and after the No-Question

intervention. The orange line and green triangle respectively represent the median and mean. We use ’+’ markers in the plot to indicate outliers.

We report the differences between confidence distributions of correct and ‘pseudo-correct’ options before and after the intervention (namely, removing the prompt) respectively in Fig 3. Along with the actual distribution, we also report the median and mean of the confidences of the correct and ‘pseudo-correct’ options for all four benchmarks. Ideally, we expect the post-intervention confidence of the pseudo-correct option to equal the reciprocal of the (benchmark-specific) number of choices per instance. We refer to this as the bias-free

confidence. In other words, a system that behaves ideally would sample a choice from the uniform probability distribution over all the available choices when the prompt is not available. The reason is that, after removing the prompt, there is no correct answer anymore. While some prior bias is expected, as discussed briefly in

Background, the extent of the bias is an open question that can only be observed empirically.

The figure shows that, while both the mean and median confidence of the pseudo-correct option decreases in each of the four benchmarks compared to the pre-intervention confidence of that (then correct) option, the extent of the decline depends on the benchmark and never equals the bias-free confidence, which equals 0.5 for PIQA and aNLI, 0.25 for HellaSwag and 0.33 for SocialIQA (reported also in Table 1 in the next experiment). We find that the pre-intervention confidence distribution obtained by RoBERTa on PIQA in particular is very similar to the post-intervention distribution. On SocialIQA, in contrast, the difference is more marked, and the post-intervention mean confidence almost achieves the bias-free value of 0.33. Therefore, RoBERTa is more likely to randomly choose an option for the SocialIQA benchmark after the prompt is removed, while there is significant prior bias in PIQA.

The dispersion

of post-intervention confidence also differs from expectations that would hold in a bias-free setting. On SocialIQA and PIQA, RoBERTa obtains a slightly smaller inter-quartile range of post-intervention confidences; however, considerably larger inter-quartile ranges are observed on HellaSwag and aNLI, compared to the pre-intervention dispersion. The confidences of ‘pseudo-correct’ options in HellaSwag and aNLI are therefore more variable than in the other benchmarks. Thus, for some instances, RoBERTa needs the prompt to guide it toward the correct answers, but for others, it can directly make decisions by only looking at the options and without requiring any prompt. Put together, the results suggest fairly strong prevalence of prior bias, with SocialIQA serving as the lone exception (and with considerable dispersion of its own).

3.1.2 Wrong-Question Probe

Key findings for the Wrong-Question probe can be found in Table 1. For reference, we show in Column 2 the ‘bias-free’ pseudo-accuracy (i.e. without any prior bias, as discussed toward the end of Background), again, is equal to the reciprocal of the (benchmark-specific) number of choices per instance. This concept is defined similarly as in the previous experiment. Note that this is technically also equal to the bias-free confidence in the pseudo-correct choice. In contrast, in Column 4, we report the actual (average) pseudo-accuracy achieved on each of the four benchmarks using the RoBERTa model, following the Wrong-Question intervention.

Dataset Bias-free pseudo-accuracy / confidence Average confidence difference Actual pseudo-accuracy (+/- std. err.)
PIQA 0.5 0.015 0.72 (+/- 0.0014)
aNLI 0.5 0.044 0.59 (+/- 0.0024)
SocialIQA 0.33 0.022 0.40 (+/- 0.0009)
HellaSwag 0.25 0.060 0.61 (+/- 0.0081)
Table 1: A summary of results following the Wrong-Question intervention.

Consistent with what we found in Fig 3, RoBERTa achieves a high pseudo-accuracy post-intervention on PIQA, and closer-to-ideal pseudo-accuracy on SocialIQA. Therefore, on PIQA the model is clearly susceptible to this kind of confusion, possibly because of a strong prior bias. While not as extreme as SocialIQA, the model does show a marked decrease in pseudo-accuracy on the HellaSwag and aNLI benchmarks, compared to the original pre-intervention accuracy. Standard errors on HellaSwag and aNLI are also higher than on PIQA and SocialIQA, which is a direct consequence of the confidences of pseudo-correct options on these two benchmarks being more variable (than PIQA and SocialIQA). In all cases, results are significantly different, at the 99 percent confidence level or higher, compared to both the ideal pseudo-accuracy, as well as the pre-intervention accuracy. The latter is encouraging, but expected, and supports the intuition that the prompt matters for the model but the former suggests that it matters less than it should.

In addition, the average confidence difference between the No-Question probe and the Wrong-Question probe in each benchmark is not only positive (Column 3 in Table 1), but we have verified all of them to be significantly greater than 0 with at least 99 percent confidence. The result implies that the appearance of the wrong prompt makes the model start to ‘doubt’ the pseudo-correct option compared to the No-Question probe, and to reconsider which option is more correct for the current mismatched prompt (following the Wrong-Question intervention). On average, we have also verified the confidence of the model in the pseudo-correct option to be significantly different from the pre-intervention confidence in that (then correct) option. We also confirmed that this average pseudo-correct confidence is significantly different from the ideal confidence noted in Column 2.

It bears noting that even in the best case (SocialIQA), although RoBERTa’s pseudo-accuracy is significantly closer to the ideal compared to the other benchmarks, it is still significantly different. Therefore, the assumption that the model needs to be presented with an actually correct option is a powerful one that should not be underestimated in practice. When the options are all incorrect, the model does not just randomly or uniformly pick one out as its choice. There is clear evidence of a prior bias, a finding also supported earlier by the No-Question probe.

3.2 Choice-Based Interventions

3.2.1 No-Right-Answer Probe

Fig 4 summarizes results on the No-Right-Answer probe using two key statistics. Before describing these statistics, we define some supporting measures. Specifically, let ANC and ANC’ respectively represent the average confidence of the model in the non-substituted options pre- and post-intervention. For benchmarks such as PIQA and aNLI, this is the confidence of the lone non-substituted option, since they only offer two choices per instance. For the other two benchmarks, where there are more than two choices, we average the confidences of all incorrect non-substituted choices. In contrast, RAC represents the confidence of the real correct option before the intervention, while SAC represents the confidence of the substituted option after swapping out the correct option (following the intervention).

Using the measures above, we can now define the two key statistics and . The former is a measure of how biased the model is toward non-substituted incorrect options compared to the substituted incorrect option. Note that the non-substituted options tend to have more surface similarity to the prompt (because of the way the benchmarks were conceived and designed, in order to be reasonably challenging for QA models), compared to the substituted option, even though they are all incorrect. The measure is instead a pre-intervention measure, and measures the confidence gap of the model between the non-substituted choices and the original correct choice. Ideally, should be -1.0, and we do indeed find that the model tends toward this number for almost all four benchmarks.

In contrast, for the measure, the model should tend toward 0 when only the correctness of an option, rather than surface similarity to the prompt, matters. In fact, given the positive bias of in Fig 4, it is clear that RoBERTa tends to choose one of the non-substituted options. Among the four benchmarks, RoBERTa obtained the highest , as well as relatively low (and, consequently, the greatest difference between the and measures) on the aNLI benchmark. Interestingly, we observe higher pre-intervention standard errors for the measure for aNLI, in contrast with that of PIQA, where the post-intervention spread is higher. Furthermore, for PIQA, the post-intervention and pre-intervention confidence spreads overlap much more than for other benchmarks.

Perhaps the most interesting qualitative takeaway from the results is that surface similarity between prompt and answer can clearly matter a lot, even when the answer is wrong. For difficult questions or particularly creative (and correct) answers to such questions, this bias may prove to be problematic. As we note in Discussion, this problem also exists in the newer QA models that rely on billions of parameters to achieve better performance.

Figure 4: Mean of confidence differences (with error bars) for each benchmark before and after the No-Right-Answer intervention. ANC, ANC’, RAC and SAC are defined in the text. The error bars represent standard errors of confidence differences, ranging from 0.37 to 0.75.

3.2.2 Choice Paralysis Probe

Figure 5: The average confidence of RoBERTa in the correct option, with standard errors, following the Choice Paralysis intervention. Results are displayed for different values of and choice-sampling methodologies (random and heuristic). The grey bar is the average confidence of the correct answer in the original dev. set before the intervention). The error bars range from 0.01 to 0.07.
Figure 6: The hits@k performance of the model following the Choice Paralysis intervention. The dashed grey line in each sub-plot shows the original performance (accuracy) on each benchmark. As in Fig 5, represents the total number of choices following the intervention, results are similarly illustrated for both choice-sampling methodologies.

Fig 5 illustrates our key findings when intervening using the Choice Paralysis probe. In principle, a language model like RoBERTa should be able to choose the correct answer no matter how many choices are provided per instance, and the confidences of correct answers should be close to their pre-intervention confidence. In practice, we expect some loss in both accuracy and confidence, but similar to the earlier probes, the extent of this loss can only be determined empirically. In Fig 5, however, we find, interestingly enough, that when and the sampling methodology is random, RoBERTa is able to achieve better performance on three of the four benchmarks (PIQA is the lone exception). This suggests that, when a random wrong answer is inserted as a choice, the model is better (with higher confidence) able to pick out the right answer on most of the benchmarks.

When the sampling is heuristic, which is more adversarial since we are deliberately trying to confuse the model with an option that has greater chance of being more related to the prompt and to the other answers, leads to significant decline in performance for all benchmarks except aNLI (where there is a decline, but is not significant). In fact, the results suggest that aNLI is the ‘easiest’ benchmark for RoBERTa when applying this specific probe, since it is the only benchmark where it is able to stay within a 20% margin of its original performance even in the most aggressive setting (, with heuristic sampling). Even so, the results clearly illustrate that even on this benchmark, RoBERTa is not immune from the choice paralysis problem indefinitely.

In general, except for with random sampling, the result is an expected one across all four benchmarks: average post-intervention confidence on each of four benchmarks shows a decreasing linear-like dependence on the number of choices per instance, when random sampling is used. When goes from 5 to 10, and then 10 to 15, the model shows similar decrease, on average, in the confidence of the correct option. Results for the heuristic sampling methodology are much more straightforward, with RoBERTa exhibiting confusion consistently and to a more extreme extent than with the random sampling (keeping fixed).

One aspect of the results in Fig 5 is that, as the number of options expand, only taking the confidence of the correct option into account (or using the accuracy metric) to understand the confusion properties of the language model, may be too harsh. An alternate way to understand choice paralysis is to use the hits@k metric. This metric does not use the confidence directly, but instead ranks the options in decreasing order of confidence. With this ranking in place, and given a value for , the hits@k for an instance is 1 if the correct answer falls within the top items on the ranked list.

Since this metric clearly depends on , we vary for each benchmark and experimental setting (using the different values of and the two sampling methodologies) and plot the results in Fig 6. Consistent with earlier findings in Fig 5, we find in Fig 6 that RoBERTa continues to perform reasonably well relative to its original performance (the dashed horizontal line in each sub-plot), after random sampling intervention, and its hits@2 accuracy is higher than the original accuracy on most benchmarks except when . PIQA is again an exception, just as in Fig 5. For PIQA, the model needs to output 3 predictions (after a 5-choice intervention) before the correct option is covered at the same rate as its original accuracy.

Once again, we find that the heuristic sampling methodology causes much more severe confusion for the model. RoBERTa starts to need more top ranked predictions, especially on PIQA and HellaSwag, to reach the original accuracy. On PIQA, the model may need to see more than half the options () before its hits@k accuracy equals the original accuracy. While not as extreme as PIQA, on HellaSwag, RoBERTa typically needs to make six predictions to maintain its original performance in the experimental setting.

An additional important point that bears noting is that, as we expand the number of options , the time-cost for answering a prompt exhibits an increasing near-linear relationship in . For example, on PIQA, the average time that RoBERTa took to answer 50 instances went from 30 minutes to 98 minutes when presented with 15 options, compared to when it was presented with 5 options. As we briefly note in Discussion, the time complexity can be more extreme with newer transformer-based models.

3.3 Confidence Calibration

Recent studies have suggested that the BERT-based models can be overly confident in their predictions Wallace et al. (2019); Ribeiro et al. (2018). We use MaxProb as a calibration technique to distinguish pre- and post-intervention instances Hendrycks and Gimpel (2016); Kamath et al. (2020); Zhang et al. (2021)

. As described below, MaxProb makes this distinction by using the probability assigned by the underlying multiple-choice NLI system to the most likely (i.e., highest-probability) prediction among the candidate choices. If it decides that an instance is post-intervention, it recommends that the NLI system abstain from answering. Previously, MaxProb was found to give good confidence estimates on multiple-choice benchmarks.

To train MaxProb, we randomly selected 50% instances from a benchmark’s dev. set, perturbed these instances using a probe101010Note that we only test No-Question, Wrong-Question, and No-Right-Answer probes here., and used the mixture of original and perturbed instances as a training set. The average MaxProb over this set is treated as a threshold, with the other 50% instances composed of an evaluation set. In the evaluation set, the instances are perturbed using the same probe. Only if the MaxProb of an instance is higher than the learned threshold, the instance will be predicted as an original (i.e., non-perturbed) instance. The accuracy of MaxProb is defined as the proportion of instances correctly predicted as pre- or post-intervention instances.

Table 2 shows the per-benchmark learned thresholds using different confusion probes. Social IQA had the lowest thresholds (among all benchmarks) on all three confusion probes; however, its average learned MaxProb threshold111111As mentioned before, RoBERTa should be equally confident about each of its candidate choices on a post-intervention instance if there is no prior bias. Hence, ideally, the MaxProb on a given benchmark should just be the reciprocal of the number of candidate choices (per instance, in that benchmark). is still 0.7 or higher. Because of the high threshold, MaxProb should help RoBERTa more easily abstain from answering post-intervention instances. However, when we used MaxProb to distinguish perturbed instances in the evaluation set, we found its accuracy to only be about 0.64 or below on most benchmarks (except on the SocialIQA No-Question probe). The evidence also suggests that the No-Right-Answer instances are the hardest cases for MaxProb, with the Wrong-Question instances being the easiest. Although we focused on MaxProb in these additional experiments, future work could substitute MaxProb for other, more recently proposed selective prediction methodsKamath et al. (2020); Zhang et al. (2021).

No-Question Wrong-Question No-Right-Answer
aNLI 0.88 (0.59) 0.86 (0.64) 0.92 (0.46)
HellaSwag 0.82 (0.58) 0.8 (0.61) 0.8 (0.62)
PIQA 0.76 (0.53) 0.76 (0.55) 0.81 (0.41)
SocialIQA 0.70 (0.70) 0.74 (0.60) 0.76 (0.55)
Table 2: The learned MaxProb thresholds of different confusion probes on four benchmarks. The accuracy of MaxProb to distinguish pre- and post-intervention instances in each evaluation set is shown in brackets.

4 Discussion

While the results described in the previous section clearly indicate non-trivial, and often significant, evidence of both prior bias and choice paralysis, they do not shed much light on the potential causes of these issues. For example, is fine-tuning a major driver of prior bias, and are larger models that were released subsequent to BERT and RoBERTa equally prone to prior bias and choice paralysis? While causal questions are fundamentally difficult to address using only a limited set of benchmarks, models and interventions, we provide a discussion in this section using auxiliary analyses. Specifically, we aim to illustrate the potential impact of fine-tuning by evaluating other models, including a non-fine-tuned version of RoBERTa, and we also evaluate non-BERT-based models to understand if the problem is specific to BERT’s architecture. We also study if syntactic properties or ‘irregularities’ in the benchmarks themselves may have contributed to the issues.

4.1 Irregularities Analysis

The evidence from the experimental study suggests that specific benchmarks used to fine-tune the model have some effect on the bias; hence, we begin by conducting an ‘irregularities analysis’ i.e., by testing effects of label imbalance, the distribution of prompt-lengths, and the words-overlap between the prompt and the candidate answers, on our benchmarks, to confirm or exclude some causes for the bias.

Specifically, the label imbalance analysis aims at testing whether the first listed choice is labeled in a benchmark’s dev. set as correct with a higher-than-random probability. Statistically, we found that labels tend to be evenly distributed in all four benchmarks’ dev. sets, and hence, there is no label imbalance or ‘choice ordering’ bias of any kind. For example, in the 2-option PIQA benchmark, the first option is the correct answer in 49.5% (910 out of 1838) of the instances. In aNLI, the first option is the correct answer in 51% (781 out of 1532) of the instances. In the 3-option Social IQA benchmark, 33.4% and 33.6% of the instances are labeled with ‘answerA’ and ‘answerB’ as the correct answer, respectively, which again suggests near-random label distribution. Similarly, in the 4-option HellaSwag, the frequencies of the four options labeled as the correct answers are 25% (each).

Similarly, we also calculated the average prompt lengths

(in terms of the number of words) in the two sets comprising the selected (by the multiple-choice NLI system) and non-selected candidate choices to determine if there is some kind of a length bias. While we did find that the selected choices tended to be longer than the non-selected choices over all benchmarks, the difference was slight and not statistically significant. For example, in PIQA, the average length of selected candidate answers is 19.24 words (with 95% confidence interval of [18.38, 20.11]), while the average length of non-selected candidate answers is 18.43 words. In Social IQA, the average length of selected candidate answer is 3.72 words, (with 95% confidence interval of [3.62, 3.82]), while the average length of non-selected candidate answers is 3.69 words. The difference was not found to be significant at the 95% confidence level.

Finally, we analyzed the words-overlap between the candidate choices and the prompt, by grouping all candidate choices into two sets (selected and non-selected), similar to the grouping employed in the analysis above. We found that most of the overlapping words between the prompts and choices (in either set) were stopwords. Indeed, when we calculated the Pearson’s correlation between the word frequency distributions over the selected and non-selected sets, the correlation was found to be close to 1. Social IQA achieved the lowest correlation value (0.982) between the two sets, while HellaSwag achieved the highest value (0.997).

Taken together, these results suggest that the surface irregularities in benchmarks likely do not explain the prior bias that we observed in our experiments. While a full causal analysis of prior bias is difficult to determine experimentally without a larger set of controls (comprising both carefully constructed benchmarks and a broader range of perturbation functions and confusion probes), at least one potential cause could be the hidden patterns in the datasets used for fine-tuning. For example, it has been found that ‘annotation artifacts’ Gururangan et al. (2018) can be introduced unintentionally by crowd workers constructing the benchmark (by devising hypotheses and candidate choices). Another hypothetical cause is that models may have picked up the bias during pre-training (e.g., due to increased frequency of some terms). A complete analysis of these hypothetical causes is left for future research.

4.2 Additional Experiments Using UnifiedQA

A major advance in the last few years has been the publication and release of universal ‘cross-format’ multiple-choice NLI language models that have improved performance on various benchmarks without requiring fine-tuning per benchmark, as the RoBERTa model used in this paper required. Of these models, many of the latest ones designed for NLI are based on T5-11B (e.g., unifiedQAKhashabi et al. (2020)). To verify the generalization of our findings for some of these newer models, we conducted preliminary experiments using pre-trained unifiedQA to replicate some of the findings described earlier using all four confusion probes.

Despite being trained already, the unifiedQA T5-11B model takes much longer to answer prompts due to the large size of its parameter space (in the billions). Therefore, in our preliminary experiments, we repeatedly sampled 50 instances three times from the dev. set of each benchmark (for each intervention), and averaged the results to obtain stable performance estimates. The average original accuracy of unifiedQA on the four benchmarks is found to be 0.62. Considering a single model was used and no fine-tuning was done, this is a high performance compared to previous models tested with similar constraints. Comparative performance depends on the benchmark e.g., unifiedQA achieved similar performance (0.75) on PIQA, had a little performance decrease (0.77) on aNLI, but achieved lower performance on SocialIQA (0.52) and HellaSwag (0.42).

When applying the No-Question probe, unifiedQA achieved a near-random performance on SocialIQA (0.38), just as RoBERTa had earlier. Encouragingly, unifiedQA had a more significant decrease in its performance on PIQA (the average ‘pseudo-accuracy’ is 0.58), compared to RoBERTa (0.73). This may be a sign of progress in the NLI model’s development, but the results would need to be replicated in the future for the full dev. set, and not just the 50-instance sample (on which we averaged performance across three independent trials). On aNLI, unifiedQA obtained a similar ‘pseudo-accuracy’ (0.59) compared to RoBERTa (0.64). On HellaSwag, unifiedQA’s pseudo-accuracy decreased to 0.34, compared to RoBERTa’s post-intervention pseudo-accuracy of 0.66. While the tendency is similar, unifiedQA achieved lower original performance on HellaSwag, and the percentage decline in unifiedQA’s pseudo-accuracy compared to the original accuracy is much larger. This is again a promising sign but significant prior bias clearly still exists, although less than RoBERTa.

When applying the Wrong-Question probe, unifiedQA achieved a lower pseudo-accuracy compared to the No-Question probe on Hellaswag and SocialIQA, but a higher pseudo-accuracy on PIQA and aNLI. The largest decrease was observed on HellaSwag, since the ‘pseudo-accuracy’ changed from 0.34 (No-Question probe) to 0.28 (Wrong-Question probe). On SocialIQA, there is a 0.04 decrease in unifiedQA’s pseudo-accuracy compared to the No-Question probe. On PIQA, however, the ‘pseudo-accuracy’ increases from 0.58 to 0.68 and on aNLI, the ‘pseudo-accuracy’ had a 0.04 increase. Therefore, we again start to see benchmark-specific divergence in the behavior of a model.

When applying the No-Right-Answer probe, we verified that unifiedQA also preferred the non-substituted options (which tend to be more contextually related to the prompt, even though they are incorrect choices), similar to what RoBERTa did, especially on PIQA. The probability of RoBERTa choosing the substituted option on PIQA is 0.37, but for unifiedQA, the probability drops to 0.26. On SocialIQA and HellaSwag, the two models behaved similarly for this probe. However, on aNLI, unifiedQA’s ‘pseudo-accuracy’ increases from 0.13 to 0.24, compared to ROBERTa. Once again, we observe benchmark-specific divergence. Besides, unifiedQA has a higher preference for semantically related (to the prompt), but still incorrect, answers compared to the fine-tuned RoBERTa.

Finally, when applying the Choice Paralysis probe, the average accuracy of unifiedQA on each benchmark is near-random. This suggests, unlike the question-based interventions, that there is considerable room for improvement still since unifiedQA is obviously susceptible to choice paralysis. Following the 5-choice intervention, the accuracy of unifiedQA averaged across the four benchmarks is around 0.14, much lower than RoBERTa’s near-original performance. Combined with the fact that unifiedQA exhibited disappointing performance on HellaSwag, the model may not be as capable of handling instances that have too many long-length choices, compared to instances and benchmarks that provide a relatively limited set of (2-3) short-length options.

4.3 Additional Experiments Using XLNet

XLNetYang et al. (2019) is an extension of the Transformer-XL model. It is pre-trained using an autoregressive method to learn bidirectional contexts. Under comparable experiment settings, XLNet was found to out-perform BERT on 20 tasks. Hence, we selected the pre-trained XLNet as another efficient model to conduct preliminary experiments using all four confusion probes. The version of XLNet we used is ‘xlnet-large-cased’, provided by Wolf et al. (2020).

Compared to unifiedQA, XLNet answers questions much faster; hence, we did not need to sample instances, but were able to replicate the experiments by just inputting all instances in the dev. set to XLNet. We found that, unlike unifiedQA, the non-fine-tuned XLNet only achieved random performance on the original dev. sets on all benchmarks (0.51 on aNLI, 0.52 on PIQA, 0.34 on Social IQA, and 0.24 on HellaSwag). After applying the No-Question, Wrong-Question, and No-Right-Answer probes, the ‘pseudo-accuracy’ of XLNet on post-intervention instances remains random. The ideal result on the post-intervention instances is hence achieved at the cost of much lower performance on the pre-intervention instances, which is the inverse of what was observed for the equivalent fine-tuned RoBERTa model. When we implemented the 5-choice intervention, the average accuracy of XLNet across all benchmarks decreased to 0.17. While being slightly higher than unifiedQA, the performance is still lower than random.

4.4 Additional Experiments Using Pre-trained RoBERTa

To understand the dependencies between fine-tuning and the extent of prior bias and choice paralysis in RoBERTa, we also repeated these experiments using just the pre-trained RoBERTa model (i.e., that is not fine-tuned). We used a pre-trained RoBERTa model with a facility for multiple-choice classification, provided by Wolf et al. (2020). Similar to the pre-trained XLNet, the pre-trained RoBERTa was also found to have near-random accuracy on the original dev. sets of all four benchmarks (0.51 on aNLI, 0.50 on PIQA, 0.34 on SocialIQA, and 0.28 on HellaSwag). Additionally, we found that, after applying the No-Question, Wrong-Question and No-Right-Answer probes, the pre-trained RoBERTa model’s ‘pseudo-accuracy’ remained random. Interestingly, when we implemented the 5-choice intervention, the average accuracy of RoBERTa is 0.24, which is slightly higher than random (0.2). On select benchmarks, such as aNLI, the accuracy is as high as 0.3 (although still much lower than the ideal of 1.0). This result again supports the previous claim that ‘aNLI’ is the easiest post-intervention benchmark for RoBERTa following application of the choice paralysis probe.

5 Summary

In this paper, we proposed to study the statistical prevalence of prior bias and choice paralysis in a popular language representation model based on BERT that has been extensively used in the commonsense reasoning community. Our methodology and experiments rely on publicly available benchmarks and multiple-choice NLI model, neither of which the authors had any role in developing or disseminating. We found evidence for both phenomena, although the extent of the phenomenon depends on the dataset and (in the case of choice paralysis) the experimental parameters used in the intervention.

Further analysis, including an ‘irregularities analysis’, suggests that these are complex phenomena not caused simply due to surface irregularities or artifacts in the prompts and choices. Interestingly, evidence of these phenomena is observed even in a more recent and independently pre-trained model, such as unifiedQA. Additional analyses reported in the Discussion section, such as using the pre-trained RoBERTa model (without fine-tuning) and another transformer-based model called XLNet, show that the ideal bias-free performance (on perturbed instances) is achieved by these models, but at the cost of significantly lower performance on the original dev. set, which is the inverse of what was observed for the equivalent fine-tuned model. Namely, the benchmark-specific fine-tuned model achieved excellent performance on the unperturbed instances in the dev. set of the benchmark, but then exhibits prior bias on the perturbed instances.

In the future, we plan to extend the methodology to conduct detailed robustness studies of other linguistic phenomena and biases. Such studies provide important insights into the workings and biases of transformer-based models, even as they become more complex and widespread.

This work was funded under the DARPA Machine Common Sense program.


  • A. Adhikari, A. Ram, R. Tang, and J. Lin (2019) Docbert: bert for document classification. arXiv preprint arXiv:1904.08398. Cited by: §1.
  • E. Alsentzer, J. R. Murphy, W. Boag, W. Weng, D. Jin, T. Naumann, and M. McDermott (2019) Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323. Cited by: §1.
  • S. Balasubramanian, N. Jain, G. Jindal, A. Awasthi, and S. Sarawagi (2020) What’s in a name? are bert named entity representations just as good for any other name?. arXiv preprint arXiv:2007.06897. Cited by: §1.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. Cited by: §1.
  • L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo (2009) The fifth pascal recognizing textual entailment challenge.. In TAC, Cited by: §1.
  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. Yih, and Y. Choi (2020) Abductive commonsense reasoning. In International Conference on Learning Representations, External Links: Link Cited by: §1.1, item 1.
  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, et al. (2019) Abductive natural language inference (anli). Note: Cited by: §1.1, item 1.
  • Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi (2020a) PIQA: reasoning about physical commonsense in natural language. In

    Thirty-Fourth AAAI Conference on Artificial Intelligence

    Cited by: §1.1, item 3.
  • Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al. (2020b) Physical iqa. Note: Cited by: §1.1, item 3.
  • I. Dagan, O. Glickman, and B. Magnini (2005) The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • A. Ettinger (2020) What bert is not: lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8, pp. 34–48. Cited by: §1.
  • A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli (2019) Eli5: long form question answering. arXiv preprint arXiv:1907.09190. Cited by: §1.
  • Z. Gao, A. Feng, X. Song, and X. Wu (2019) Target-dependent sentiment classification with bert. IEEE Access 7, pp. 154290–154299. Cited by: §1.
  • D. Giampiccolo, B. Magnini, I. Dagan, and W. B. Dolan (2007) The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Cited by: §1.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324. Cited by: §1.1, §4.1.
  • R. B. Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and I. Szpektor (2006) The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Cited by: §1.
  • D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §3.3.
  • L. Hirschman and R. Gaizauskas (2001) Natural language question answering: the view from here. natural language engineering 7 (4), pp. 275. Cited by: §1.
  • S. Iyer, N. Dandekar, and K. Csernai (2016) First quora dataset release: question pairs. Note: https:// Cited by: §1.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does bert learn about the structure of language?. In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
  • D. Kahneman (2011) Thinking, fast and slow. Macmillan. Cited by: §1.
  • A. Kamath, R. Jia, and P. Liang (2020) Selective question answering under domain shift. arXiv preprint arXiv:2006.09462. Cited by: §3.3, §3.3.
  • D. Kaushik and Z. C. Lipton (2018) How much reading does reading comprehension require? a critical investigation of popular benchmarks. arXiv preprint arXiv:1808.04926. Cited by: §1.1.
  • D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi (2020) Unifiedqa: crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700. Cited by: §4.2.
  • G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui (2020)

    Attention module is not only a weight: analyzing transformers with vector norms

    arXiv preprint arXiv:2004.10102. Cited by: §1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §1.
  • J. Lee and J. Hsiang (2019) Patentbert: patent classification with fine-tuning a pre-trained bert model. arXiv preprint arXiv:1906.02124. Cited by: §1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §1.
  • R. Lenz and M. A. Lyles (1985) Paralysis by analysis: is your planning system becoming too rational?. Long Range Planning 18 (4), pp. 64–72. Cited by: §1.
  • P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk (2019) Mlqa: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475. Cited by: §1.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019a) Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855. Cited by: §1.
  • W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang (2020)

    K-bert: enabling language representation with knowledge graph

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 2901–2908. Cited by: §1.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §1.
  • R. T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007. Cited by: §1.
  • M. Munikar, S. Shakya, and A. Shrestha (2019) Fine-grained sentiment classification using bert. In 2019 Artificial Intelligence for Transforming Business and Society (AITB), Vol. 1, pp. 1–5. Cited by: §1.
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042. Cited by: §1.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) Coqa: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2018) Semantically Equivalent Adversarial Rules for Debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 856–865. External Links: Link, Document Cited by: §3.3.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118. Cited by: §1.
  • K. Richardson, H. Hu, L. Moss, and A. Sabharwal (2020) Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8713–8721. Cited by: §1.
  • A. Rogers, O. Kovaleva, and A. Rumshisky (2020) A primer in bertology: what we know about how bert works. Transactions of the Association for Computational Linguistics 8, pp. 842–866. Cited by: §1, §1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §1.
  • M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi (2019a) Social iqa: commonsense reasoning about social interactions. In EMNLP 2019, Cited by: §1.1, item 4.
  • M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi (2019b) Social iqa. Note: Cited by: §1.1, item 4.
  • B. Schwartz (2004) The paradox of choice: why more is less. Cited by: §1.
  • K. Shen and M. Kejriwal (2021) On the generalization abilities of fine-tuned commonsense language representation models. In International Conference on Innovative Techniques and Applications of Artificial Intelligence, pp. 3–16. Cited by: §1.
  • J. Shin, Y. Lee, and K. Jung (2019) Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pp. 1081–1093. Cited by: §1.
  • W. Siblini, C. Pasqual, A. Lavielle, and C. Cauchois (2019)

    Multilingual question answering from formatted text applied to conversational agents

    arXiv preprint arXiv:1910.04659. Cited by: §1.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473. Cited by: §1.
  • T. Thongtan and T. Phienthrakul (2019) Sentiment classification using document embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 407–414. Cited by: §1.
  • E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh (2019) Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2153–2162. External Links: Link, Document Cited by: §3.3.
  • E. Wallace, Y. Wang, S. Li, S. Singh, and M. Gardner (2019) Do nlp models know numbers? probing numeracy in embeddings. arXiv preprint arXiv:1909.07940. Cited by: §1.
  • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu, W. Merrill, et al. (2020) Cord-19: the covid-19 open research dataset. ArXiv. Cited by: §1.
  • Y. Wang, W. Rong, Y. Ouyang, and Z. Xiong (2019) Augmenting dialogue response generation with unstructured textual knowledge. IEEE Access 7, pp. 34954–34963. Cited by: §1.
  • A. Warstadt and S. R. Bowman (2020) Can neural networks acquire a structural bias from raw linguistic data?. arXiv preprint arXiv:2007.06761. Cited by: §1.
  • A. Williams, N. Nangia, and S. R. Bowman (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §4.3, §4.4.
  • X. Wu, T. Zhang, L. Zang, J. Han, and S. Hu (2019) " Mask and infill": applying masked language model to sentiment transfer. arXiv preprint arXiv:1908.08039. Cited by: §1.
  • Z. Wu, Y. Chen, B. Kao, and Q. Liu (2020) Perturbed masking: parameter-free probing for analyzing and interpreting bert. arXiv preprint arXiv:2004.14786. Cited by: §1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §4.3.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019a) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.1, item 2.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019b) HellaSwag. Note: Cited by: §1.1, item 2.
  • S. Zhang, C. Gong, and E. Choi (2021) Knowing more about questions can help: improving calibration in question answering. arXiv preprint arXiv:2106.01494. Cited by: §3.3, §3.3.
  • X. Zhang, F. Wei, and M. Zhou (2019a) HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization. arXiv preprint arXiv:1905.06566. Cited by: §1.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019b) Dialogpt: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Cited by: §1.