On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

by   Roy Schwartz, et al.
Hebrew University of Jerusalem

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.


page 1

page 2

page 3

page 4


Noise Stability Regularization for Improving BERT Fine-tuning

Fine-tuning pre-trained language models such as BERT has become a common...

Competency Problems: On Finding and Removing Artifacts in Language Data

Much recent work in NLP has documented dataset artifacts, bias, and spur...

Detoxifying Language Models Risks Marginalizing Minority Voices

Language models (LMs) must be both safe and equitable to be responsibly ...

Uninformative Input Features and Counterfactual Invariance: Two Perspectives on Spurious Correlations in Natural Language

Spurious correlations are a threat to the trustworthiness of natural lan...

FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary

Current models for Word Sense Disambiguation (WSD) struggle to disambigu...

A Word on Machine Ethics: A Response to Jiang et al. (2021)

Ethics is one of the longest standing intellectual endeavors of humanity...

1 Introduction

Effective human communication relies on our ability to understand extra-textual context based on common sense, world knowledge or shared cultural experiences, a property often cited as Grice’s second maxim of quantity: “Do not make your contribution more informative than is required” Grice (1975, 1989)

. Studies have estimated that only 12% of the information conveyed by text is mentioned explicitly 

(Graesser, 2013; Tandon et al., 2020). To illustrate this, consider the question “who is the president of the U.S.?”. To answer it, a human reader is likely to presume many unstated propositions, as exemplified in Tab. 1.

Figure 1: A high-level overview of the current state of supervised NLP research. Dataset developers create more aggressive filtering techniques (left), leading to larger models that are able to solve them by finding more elusive spurious correlations (right).
Who is the president of the U.S.?
Context Answer
Joe Biden
The year 2019 Donald Trump
The West Wing, season 1 Josiah “Jed” Bartlet
Table 1: Context, whether explicit or implicit, matters in textual understanding, as exemplified by the question “who is the president of the U.S.?”. E.g., in the first line, given no other context, a QA system should provide the most sensible fallback answer (Joe Biden, at the time of writing).

In contrast to humans, supervised models often fail to generalize and understand implicit context, instead resorting to low-level correlations in the data, leading to amplified bias (Zhao et al., 2017; Stanovsky et al., 2019) and brittle performance (Schwartz et al., 2017; Gururangan et al., 2018). To address this, recent approaches have suggested mitigating such correlations by balancing the dataset via either adding or removing certain instances Goyal et al. (2017); Hudson and Manning (2019); Zellers et al. (2018); Sakaguchi et al. (2020). In parallel, developers keep building larger and larger pretrained models Devlin et al. (2019); Liu et al. (2019); Raffel et al. (2020), which, when fine-tuned on these datasets, consistently manage to reach human performance. Taken together, these trends lead to an arms-race between data curation and model development (Fig. 1).

In this position paper, we question the value of mitigating spurious correlations via dataset balancing, by showing that their existence in large training sets is both inevitable and to some extent even desired, as they are an inherent property of natural language understanding. We build on a recent result by Gardner et al. (2021), who assumed that every single-word feature correlation is spurious, i.e., can be used to mislead a model. We extend their argument, showing that balancing single-word features is insufficient for eliminating all spurious correlations, and that balancing feature combination is needed for that purpose. On the other hand, we show that balancing too much leads to datasets that contain no learnable signal either. We conclude by questioning whether mitigating all spurious correlations via dataset balancing is practical.

Following, we show that this practice is also undesired. We show that ignoring these correlations will hinder the learning of fallback options for both world knowledge facts (Joe Biden is the president of the U.S.) and common sense knowledge (a person is happy when receiving a gift), thus preventing models from using this knowledge in cases of uncertainty. We conclude that the existence of spurious correlations in training sets should not be solved by creating more balanced datasets.111We emphasize that balancing methods are still useful as they can lead to mitigation of some spurious correlations, and therefore better generalization Le Bras et al. (2020); Swayamdipta et al. (2020), as well as potentially more efficient training. We argue that these methods are inherently limited in their ability to mitigate all spurious correlations.

We then discuss alternatives to mitigating spurious correlations. We argue that models should be trained to understand constructions emanating from an apriori theory of language, such as negation, sarcasm, humor, and metaphors. We also suggest adopting modeling approaches that identify when the context is insufficient. We argue that in such cases, the model should not fallback to default assumptions, but rather abstain or interact with the user to clear ambiguities. Finally, we question the basic procedure of large-scale fine-tuning, and suggest focusing on zero- and few-shot learning instead Liu et al. (2021b).

2 Dataset-Model Arms Race

This section provides a view of recent research in NLP as an arms race between models and datasets. Below we describe the conditions leading to this arms race, and present our main research question, challenging its value for making progress in NLP.

Models exploit spurious correlations

While pretrained models consistently perform well across multiple tasks, various studies have pointed out that this is often achieved by exploiting spurious correlations in datasets, rather than improving on the underlying task Glockner et al. (2018); Gururangan et al. (2018); Elazar et al. (2021), and that this phenomenon becomes more prominent as the models grow in size Li et al. (2021).

Mitigating spurious correlations via balancing

Various dataset curators have tried to prevent models from learning spurious correlations by modifying their training data via a careful control for the training label distribution, effectively striving for a balanced dataset. One approach is to add examples in order to balance the dataset Goyal et al. (2017); Sharma et al. (2018); Hudson and Manning (2019). For instance, the VQA2.0 dataset Goyal et al. (2017) is built by taking every (question , image , answer ) triplet in the VQA dataset (Antol et al., 2015), and adding another triplet with the same question , but a different image , guaranteed to lead to a different answer . See Fig. 2 for an example.

Figure 2: An example of dataset balancing (adapted from Goyal et al., 2017). For each (question, image) pair in the VQA dataset (left), VQA2.0 adds another image, for which the answer is different (right).

Filtering as balancing

A complementary balancing approach to augmentation is filtering examples out from datasets such that spurious correlations are minimized. This approach was taken in the creation of the SWAG dataset Zellers et al. (2018), using “adversarial filtering” (AF). In AF, dataset instances that are easily solved by an adversarial model are filtered out. The AF approach and similar approaches were picked up by many datasets such as ReCoRD Zhang et al. (2018), DROP (Dua et al., 2019), HellaSWAG Zellers et al. (2019), NLI Bhagavatula et al. (2020), and WinoGrande Sakaguchi et al. (2020).

Here we argue that approaches like AF converge to removing all low-level correlations,222Indeed, AFLite, an extension of AF, was designed to “systematically discover and filter any dataset artifact in crowdsourced commonsense problems” (Le Bras et al., 2020, emphasis in the original). and therefore a fully balanced dataset. As this approach relies on an external model, applying it with ever stronger models with higher capacity, will allow these models to pick up on subtler correlations Li et al. (2021). At the extreme, the remaining instances that could not be solved by a fully capable model will have no statistical signal that can be exploited by that model, i.e., a balanced dataset. We henceforth refer to both augmentation and filtering as balancing methods.

Large models solve the new datasets

In parallel to the efforts in dataset balancing, the leading modeling

approach in recent years in NLP is pretraining large language models on raw text corpora, followed by fine-tuning them on supervised downstream applications. These models continue to grow in size 

(Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019; Radford et al., 2019; Raffel et al., 2020), and their fine-tuning performance improves accordingly. This in turn leads to more aggressive balancing, setting in motion a kind of arms race between datasets and models (Fig. 1).

Evidently, a similar trend emerges for the previously mentioned datasets: (1) the first baselines, reflecting the state of the art at the time of dataset creation, perform relatively poorly, e.g., 59% on SWAG, 47% on ReCoRD, 47 F1 on DROP, 47% on HellaSWAG, 69% on NLI, and 79% on WinoGrande; (2) model developers introduce increasingly larger and heavily-parameterized models, hill-climbing on these datasets; and eventually (3) models essentially solve the dataset within a year or two, often outperforming humans: 86% on SWAG (Devlin et al., 2019), 94% on ReCoRD He et al. (2021b), 88 F1 on DROP (Chen et al., 2020), 93% on HellaSWAG He et al. (2021b), 92% on NLI (He et al., 2021a), and 90% on WinoGrande (Raffel et al., 2020). (4) new large-scale datasets are collected with more aggressive pruning techniques, thus repeating the cycle.

Based on these findings, our main research question is whether dataset balancing is the most promising method for mitigating spurious correlations. We note that an arms race between models and datasets might spur advances. Here we question a specific aspect of this arms race: the improvement of datasets by using more aggressive filtering techniques. Next we turn to present practical and conceptual limitations of this practice.

3 The Lost Battle Against Spurious Correlations

Name Description
ingenuine Correlations between features and output labels for no reason.
ungeneralizable Correlations that do not generalize to new contexts.
every-word Correlations between every single-word feature and output label.
Table 2: Different definitions of spurious correlations.

So far we have identified dataset balancing as a common way to mitigate spurious correlations. Next, we outline how different works define spurious correlations (Sec. 3.1), and then question whether dataset balancing is a viable way for mitigating them; we note that balancing too little is bound to leave spurious correlations in the data (Sec. 3.2), while balancing too much discards meaningful signal (Sec. 3.3). We finish by questioning whether this practice is even desired (Sec. 3.4).

3.1 What are Spurious Correlations?

Mitigating spurious correlations is frequently used as motivation for developing new balancing approaches. However, the term spurious correlations is often not clearly and consistently defined. The basic definition is a set of features that are correlated but not causally related.333https://en.wikipedia.org/wiki/Spurious_relationship

In NLP, several definitions of spurious correlations are typically used. One conceptual definition, denoted here ingenuine (e.g., Wang and Culotta, 2020; Rogers, 2021) is a feature correlated with some output label for no apparent reason. Such features often result from the annotation process (referred to as annotation artifacts; Gururangan et al., 2018). For instance, Gururangan et al. (2018)

have shown that the words “cat” and “sleeping” are correlated with contradictions in the SNLI dataset

Bowman et al. (2015).

This definition is appealing: we want our models to learn real information about the world, and not properties of a given dataset. However, it is also somewhat subjective, and could include features that might be referred to as genuine, such as the word “not” indicating NLI contradictions. Further, genuine features, i.e., those representing a real phenomenon in the world (e.g., “amazing” as a feature for positive sentiment), are also likely to lead models make to erroneous predictions in some contexts (e.g., negation or sarcasm; Gardner et al., 2021). Such features could thus harm generalization, so some might consider them spurious as well.444See Eisenstein (2022) for discussion of different feature types.

In an alternative definition, denoted ungeneralizable, a spurious feature is one that works well for specific examples but does not hold in general Chang et al. (2021); Yaghoobzadeh et al. (2021). This definition does not address the nature of the feature (genuine or not), but does make an implicit assumption that such features are of high importance (e.g., high pointwise mutual information values with the corresponding label; Gururangan et al., 2018). This definition is no longer subjective in terms of the genuineness of the feature, but is still subjective in the level of effect on generalizability (i.e., what is a high value of PMI?).

Gardner:2021 relaxed the last constraint, and assumed that every simple correlation between single word features and output labels is spurious (henceforth every-word). They then defined a class of competent

datasets, where the marginal probability for every feature is uniform over the class label, i.e., for any feature

and label , , thus limiting models from picking up any correlation between single features and output labels.

We next extend the every-word approach beyond single words, showing that models that can exploit single word features can also exploit some feature interactions, and therefore these should also be considered spurious. Tab. 2 summarizes the different definitions of spurious correlations.

3.2 Balancing too Little Leaves some Spurious Features

Gardner et al. (2021) assumed that as each word can appear in certain contexts that change its semantic meaning (e.g., negation, sarcasm), each word is potentially spurious. Here we note that the same argument can be applied to feature interactions, such as word -grams. We start with a toy example to illustrate our argument for bigrams, and then extend it for larger values of .

Split Text Label
Train very good
very bad
not good
not bad
Test not very good
Table 3: A toy example of a training set (Train), which is balanced for unigrams, but not for bigrams. Relying on the bigram correlations (e.g., memorizing that “very good” leads to a positive sentiment) will lead to mispredictions on the test set (Test).

Consider the toy dataset for the task of sentiment analysis shown in Tab. 

3, with vocabulary ={good, bad, not, very}, and label set  = { ,  }. The Train split is balanced with respect to single-word features, i.e., it is a balanced or competent dataset:

Assume the semantics of this dataset is that of English, while ‘’ means positive sentiment and ‘’ means negative.

A model trained on Train can achieve perfect training accuracy by learning the correct semantics. However, achieving perfect training accuracy can also be done by learning correlations between two-word features and the target label (i.e., memorizing all the training examples). In this case, the model would make the wrong prediction for the first test example in Test (as it has learned that very good is a feature that indicates positive sentiment), and similarly, will make a random prediction for the second test example, which does not contain any two-word feature seen during training.

This example highlights that balancing single-word features does not guarantee resiliency to spurious correlations, and therefore in order to mitigate all spurious correlations, balancing pairs of features is also required. One can construct similar examples for larger values of , by similarly considering multi-word expressions and common co-occurrences (e.g., “jaw dropping”, “worst day ever”). These could serve as spurious correlations in the same way single words do.

Another example is sarcasm. A model that fails to understand sarcastic contexts will misinterpret statements that appear in such contexts, even if it perfectly understands the base meaning of these statements. Thus, the entire reasoning process of such a model, whether relying on simple features, feature interactions, or other types of understanding, will result in mispredictions of certain inputs, and thus can be considered spurious.

As a result, to truly mitigate all spurious correlations in a dataset, balancing feature combinations is required as well. Accordingly, balancing too little will leave some spurious correlations in the dataset.

3.3 Too much Balancing Prevents Learning Valuable Semantic Knowledge

We observed that balancing too little does not allow models to fully eliminate spurious correlations. Here we show that too much balancing can prevent models from learning valuable knowledge.

Original Train Set Augmented Samples
Input Label Input Label
0 0 0 *0 0 1
0 1 1 *0 1 0
1 0 1 *1 0 0
1 1 0 *1 1 1
Table 4: Left: a training set for the XOR function, balanced for unigrams. Right: requiring that bigrams are also balanced would prevent models from learning.

Consider the training data for learning the XOR function presented in Tab. 4 (left). This dataset contains enough learnable signal when considering feature interactions despite being balanced for single words. Nonetheless, balancing this dataset for pairs of features would result in no information, and thus prevent any model from learning this function (Tab. 4, right).

Now consider a given natural language dataset . Define to be the length of the longest document in . By definition, balancing every combination of up to features leaves no learnable signal in .555We assume the standard data collection process when using AF, in which the last step is balancing Zellers et al. (2018); Dua et al. (2019), and longer instances cannot be added. We conclude that balancing too much can prevent models from learning semantic knowledge.

Combining the two observations, we are left with the question of the potential intersection between balancing too much and balancing too little: does a sweet spot exist for which no spurious correlations are found in the dataset, but enough learnable signal is left? And even if so, would a balancing algorithm, whether by augmentation or filtering, be able to find it? We leave these questions for future work, but note that addressing them is a prerequisite for the theoretical and practical application of dataset balancing for mitigating spurious correlations.

3.4 Dataset Balancing is Undesired

Even if a sweet spot exists between balancing too little and too much, do we really want to find it? Here we argue that perhaps not.

The practice of dataset balancing is designed to prevent models from learning that some words or expressions have a common fallback meaning that can stem from dataset artifacts (e.g., “cat” as an indicator of contradiction) but also from cultural and historical contexts (e.g., Biden is the U.S. president in 2022). Fallback meanings are crucial for understanding language, as contexts are often underspecified Graesser (2013). Indeed, relying on fallback meanings might make models fail to process some inputs correctly, and might not generalize to other domains where the fallback meaning is different. We argue that the ability to use them is a central ability of language understanding.

For example, substantial efforts are made to teach models world knowledge, such as that the president of the U.S. is Joe Biden, the capital of Brazil is Brasília, and France is the soccer world champion. These efforts include building world knowledge datasets Wang et al. (2021), developing methods for enhancing models with this information Zhang et al. (2019); Peters et al. (2019), and evaluating how well models capture it Rubinstein et al. (2015); Roberts et al. (2020). But many of these world-knowledge facts are context dependent: the capital of the Brazil has changed in 1960, the president of the U.S., as well as soccer world champions potentially change every 4 years, etc.

Another example is common sense knowledge, such as “people are happy when they receive a gift”, “an elephant is taller than a zebra”, and “a statue that doesn’t fit into a suitcase is too large”. A large body of work has been carried out to create benchmarks that measure the common sense abilities of models Liu and Singh (2004); Levesque et al. (2012); Zellers et al. (2018); Sakaguchi et al. (2020); Bisk et al. (2020), as well as augmenting models with such abilities Qin et al. (2020); Bosselut et al. (2021).

Common sense reasoning is, by definition, stochastic and reliant on understanding presupposed, underspecified context. One could imagine a person unhappy to receive a gift (e.g., because it is not what they wanted), a fantastically large zebra compared to a tiny elephant, and a suitcase with multiple compartments which prevent a small statue from fitting in it.

These examples illustrate that a model that learns these correlations and relies exclusively on them to make predictions is limited and is bound to make mistakes in some contexts. One way to avoid these mistakes is to balance these correlations out, and prevent models from knowing these assertions to begin with. We argue that this solution is not a desired solution. In essence, an interpreter’s task (be it human or machine) is to infer the most probable context in which a statement is made, and as a result, it should have a fallback option for such world knowledge and common sense assertions.


We recognize that a balanced dataset may not be balanced with respect to the appearance of common-sense or world-knowledge assertions in a given context. E.g., a model might balance-out the general fact that Joe Biden is the U.S. president, but not that he is the president in 2022. As in many cases much of the context is unobserved Graesser (2013), the question is whether we want models to make a prediction in cases of uncertainty based on the fallback option. We argue that doing so is a desired strategy in many cases (though a preferred strategy might be to interact of abstain from making a decisive prediction, see Sec. 4.2).

We also acknowledge that correlations in the real world can be misleading. For instance, people often mistake the biggest commercial city in some countries for their capital (e.g., Istanbul in Turkey), potentially due to the high correlation between the two. In such cases, relying on the fallback option might lead to prediction error. However, we argue that following the human strategy of relying on a fallback option in cases of uncertainty will promote models’ communication abilities.666A counter example is social biases, where we want to explicitly discourage models from having a fallback option (see Sec. 4.4 for discussion).

We want to stress that balancing methods can result in mitigating some of the spurious correlations, and therefore lead to increased generalization Le Bras et al. (2020); Swayamdipta et al. (2020). Moreover, the process of filtering the data naturally results in smaller datasets, which leads to lower training costs Swayamdipta et al. (2020). While such contribution is meaningful in terms of, e.g., environmental concerns (Strubell et al., 2019; Schwartz et al., 2020), it is orthogonal to our research question. Overall, despite the important contributions of balancing techniques, this paper shows that even the perfect balancing method might not mitigate all spurious correlations in a satisfying way.

So how can we make models more resilient to spurious correlations without balancing the data? Below we discuss several ideas for doing this.

4 Ways to Move Forward

So far, we presented limitations of dataset balancing as a means to mitigate spurious correlations. In this section we discuss several alternatives to this practice, summarized in Tab. 5. We note that none of these proposals is particularly novel. Rather, we intend to survey alternatives proposed in literature and argue that these may be promising for addressing the drawbacks of spurious correlations, and that more efforts should be put into studying them.

Current Practice Proposal
Dataset balancing Richer contexts (§4.1)
A closed label set Abstain/interact (§4.2)
Large-scale fine-tuning Few-shot learning (§4.3)
Table 5: Our suggestions for mitigating the effects of spurious correlations, listing three current practices, each with an alternative proposal.

4.1 Augmenting Datasets with Rich Contexts

The implicit assumption of dataset balancing is that in order to mitigate spurious correlations the model has to unlearn them, that is, they should be removed altogether from the training set. We argue that instead we should be focusing on learning and modeling richer contexts.

As an example, consider negation. A model that generalizes well, should learn the meaning of words such as not, and should be able to negate new words, even those that were seen only in positive contexts at training time. For example, if a model only sees during training words like “amazing” or “happy” with positive sentiment, and thus learns that these words bear positive meaning, we would still expect it to interpret their negated appearance (e.g., not amazing) as an indicator of negative sentiment. Such generalization is crucial for language learning, and should ideally allow models to not rely exclusively on spurious correlations. Despite the immense progress in the field in the past decade, negation still poses a challenge to modern NLP tools Hossain et al. (2020, 2022).777Though we should continually assess the challenge negation poses on the most recent models Bowman (2022).

We suggest taking into account different types of contexts during dataset design. In particular, collecting training examples with contexts such as negation Morante and Blanco (2012), humor Weller and Seppi (2019); Annamoradnejad and Zoghi (2020), sarcasm Davidov et al. (2010); Oprea and Magdy (2020), or metaphors Tsvetkov et al. (2014); Mohammad et al. (2016). This recommendation applies to both supervised tasks, and perhaps more so to pretrained data. We suggest adding documents with such contexts throughout the pretraining corpus, or as a continued pretraining step to existing large-scale models.888We recognize that editing pretrained corpora poses significant challenges due to their immense size, as demonstrated by recent efforts such as corpus analysis Dodge et al. (2021) and deduplication Lee et al. (2022).

To incorporate contexts from a wide range of phenomena, we can leverage the vast literature on broad-coverage semantics (Baker et al., 1998; Steedman and Baldridge, 2006; Banarescu et al., 2013; Abend and Rappoport, 2013).999See Abend and Rappoport (2017) for a survey. This line of work proposes theories of language, composing inventories of linguistic constructions with an algebraic formulation of their inter-relations in terms of truth value, factuality, and more. These inventories often include the phenomena discussed above, such as negation, sarcasm, and presupposition.

4.2 Interaction and Abstention to Cope with Underspecified Contexts

Figure 3: An example of abstention/interaction in cases of uncertainty. For the task of sentiment analysis, models currently assign a label to each input, even for ambiguous or underspecified ones (top). This may lead the model to over-rely on spurious correlations (marked in red, bottom left). Models that abstain or interact (bottom right) might learn to rely less on such correlations.

Most NLP tasks are designed with a closed label set that forces models to make a concrete prediction for each test instance, without an option to abstain or interact with the user to get more information. Even for tasks with a large label set (e.g., language modeling), models still have to output a valid vocabulary item. Here we argue that this practice creates an inductive bias towards using spurious correlations in cases of uncertainty, as the model has “nothing to lose” in case of low certainty, and is encouraged to always make some prediction, potentially relying on spurious correlations.101010We recognize that in some cases we do want the model to make a prediction under cases of uncertainty (see Sec. 3.4). The ability to detect when is it reasonable to make an educated guess is an important property of an intelligent agent, and an exciting research question.

To further illustrate this point, consider the ambiguous sentence “To my great surprise, the movie turned out different than what I thought.”, in the context of sentiment analysis. The reader cannot infer whether the writer is pleasantly surprised (a positive review) or disappointed (a negative review). We argue that in such cases models might lean towards a positive sentiment based on the words “great” and “surprise”, which are typically correlated with a positive sentiment.

To test this, we ran a RoBERTa-large model Liu et al. (2019) fine-tuned on SST-2 Socher et al. (2013) on that example.111111We used the AllenNLP demo (https://demo.allennlp.org/sentiment-analysis/). As expected, the model returns a positive label, with 99.99% confidence. Interestingly, three different interpretation methods (simple gradient visualization, Simonyan et al., 2014; integrated gradient visualization, Sundararajan et al., 2017; and SmoothGrad, Smilkov et al., 2017) all find the word “great” or “surprise” to be one of the three most influential words on the model’s prediction. While this example does not prove the prevalence of this problem, it does demonstrate its existence.

To address this problem, we suggest adopting approaches that allow models to abstain and interact when they cannot make a decision with high confidence Chow (1957); Hellman (1970); Laidlaw and Feizi (2019); Balcan et al. (2020). See Fig. 3. This can be achieved by building datasets with unanswerable questions Ray et al. (2016); Rajpurkar et al. (2018); Sulem et al. (2021), but also by designing models that abstain in cases of low certainty for all inputs, even those with an unambiguous gold label.121212Model calibration techniques DeGroot and Fienberg (1983); Guo et al. (2017); Card and Smith (2018) are often used both for allowing models to abstain Cortes et al. (2016); Shrikumar et al. (2019) and identifying unanswerable questions Kamath et al. (2020); Zhang et al. (2021). We hypothesize that encouraging the model to provide this output when it is unsure, rather than making a semi-educated guess, potentially based on spurious correlations, could reduce its dependency on such correlations.

4.3 The End of Large-Scale Fine-Tuning?

This paper has demonstrated the limitations of mitigating spurious correlations via dataset balancing. A naive way to mitigate spurious correlations is to stop using large-scale datasets altogether. We echo recent calls Liu et al. (2021b)

and argue that for supervised learning (i.e., large-scale fine-tuning), recent advances in zero- and few-shot learning might make this option possible.

Large pretrained models such as T5 Raffel et al. (2020) or GPT-3 Brown et al. (2020), trained on vast amounts of data, arguably learn enough about the world to acquire many of the skills currently learned through supervised learning. Indeed, the large increase in the size and capacity of pretrained language models has led to a new wave of few-shot and zero-shot methods Schick and Schütze (2021a); Shin et al. (2020); Gu et al. (2022), which are able to reach human-level performance on certain tasks using only a few dozens of training examples Schick and Schütze (2021b). Given these impressive results, it is not clear whether there is still value in fine-tuning models on large-scale datasets for all tasks. In the context of this work, focusing on few-shot learning might allow models to not learn some of the correlations that result from manual annotation Schwartz et al. (2017); Gururangan et al. (2018); Poliak et al. (2018), as they will not be exposed to many of them to begin with.

We note that this proposal is not a perfect solution. First, some spurious correlations may be picked up by the small number of examples. This is less of a problem in the zero-shot setting, or in cases where the model parameters are not updated in few-shot settings Brown et al. (2020), but studying the extent to which spurious correlations are picked up in other few-shot settings is an important avenue for future research. Second, some spurious correlations might be picked up during the pretraining stage Gehman et al. (2020); Birhane et al. (2021); Dodge et al. (2021). Continuing to quantify this phenomenon and finding ways to mitigate it is another important line of research.

An important question in this context is the tasks for which supervised learning is still needed. It seems plausible that excelling in language modeling tasks requires mastering the skills that stand in the base of many NLP tasks, such as sentiment analysis, syntactic parsing, and NER. However, it is similarly plausible that this is not the case for other tasks, e.g., summarization, simplification and dialogue. We are cautious in making concrete recommendations for which tasks to apply this principle, but suggest the following intuitive rule of thumb: for datasets or tasks for which the state of the art is close to or surpasses the human baseline, we should consider moving to few-shot setups.

Finally, dataset creation is still a valuable and important line of research. Our recommendation to stop building large scale training sets does not make this task redundant, to both spur the design of better models, and to better test their capabilities. We suggest that instead of building large training sets and small validation and test sets, authors should consider building large test sets, as a means for achieving improved statistical power Card et al. (2020).

4.4 A Note on Social-Bias Correlations

So far, we discussed the problems with unlearning spurious correlations, and advocated instead for more elaborate context modeling. One exception might be the case of social biases. Textual data often reflects human stereotypes such as spurious correlations between labels and protected group attributes, e.g., alignments between professions and gender or race. Unlike other types of knowledge discussed in Sec. 3.4, in this case there is an incentive to prevent models from learning this type of correlation as means for actively reducing the harms of such biases, especially in commercial and public-facing applications, such as machine translation (Stanovsky et al., 2019) or automated financial decision-making (Bartlett et al., 2021). As a result, methods for dataset balancing are no longer undesired for mitigating such spurious correlations.

Nonetheless, as demonstrated in Sec. 3, methods for dataset balancing are a limited solution for mitigating spurious correlations, including social-bias ones. In contrast, the methods proposed in this section for mitigating spurious correlations might also assist in mitigating social biases, or at least slow down their amplification Zhao et al. (2017).

5 Related Work

This paper discusses the arms-race between models and datasets. Previous works criticized one side of this arms race—the increasing size of pretrained models—due to ethical and environmental concerns (Schwartz et al., 2020; Bender et al., 2021), or questioning its ability to learn meaningful abstractions from raw text Bender and Koller (2020); Merrill et al. (2021). This work studies the second part of this arms race, regarding the efforts to mitigate spurious correlations through dataset balancing. The release of such datasets is often motivated by their potential to spur progress in modeling, and to help tease apart qualitative differences between models. Liu et al. (2021a)

showed that this is not necessarily the case, by observing that the ranking of reading comprehension models on small and synthetic benchmarks is similar to that of the (large) SQuAD dataset

Rajpurkar et al. (2016).

Raji et al. (2021) recently criticized the concept of benchmarks as a whole, arguing that they are only capturing specific skills and not “general” capabilities. Our paper raises related concerns about training sets implicitly containing spurious correlations, and suggests reconsidering the practice of building large-scale training sets.

Finally, concurrent to this work, Eisenstein (2022) discussed several types of spurious correlations in the context of causality theory Pearl (2009), and used a toy example to demonstrate their different effects on models. They concluded that domain knowledge is required to identify the correlations that are indeed spurious, i.e., those that might challenge the generalization ability of models.

6 Conclusion

Spurious correlations in large textual corpora can result in model brittleness, lack of generalization, and an inflated sense of the state of the art. Mitigating their negative side-effects is an important research goal of the NLP community. In this paper we presented practical and conceptual limitations of dataset balancing as a means for doing so. We proposed alternative ways for mitigating spurious correlations, including adding richer contexts to textual corpora, and allowing models to abstain or interact in cases of uncertainty. We concluded by suggesting to reconsider the practice of fine-tuning pretrained models on large-scale training sets.

7 Broader Impact and Ethical Consideration

Our work did not involve any new data or annotation collection, and as such did not require crowdsourced or in-house workers, or introduces any new models and related risks. Instead, we examine existing resources and common data balancing approaches. In Section 4.4 we specifically discuss the relation between these practices and implications on social bias in models.


We would like to thank Matt Gardner and Will Merrill for the in-depth discussion. We would also like to thank Omri Abend, Yoav Goldberg, Inbal Magar, and the anonymous reviewers for their feedback. This work was supported in part by research gifts from the Allen Institute for AI.


  • O. Abend and A. Rappoport (2013) Universal Conceptual Cognitive Annotation (UCCA). In Proc. of ACL, External Links: Link Cited by: §4.1.
  • O. Abend and A. Rappoport (2017) The state of the art in semantic representation. In Proc. of ACL, External Links: Link, Document Cited by: footnote 9.
  • I. Annamoradnejad and G. Zoghi (2020) ColBERT: using bert sentence embedding for humor detection. Note: arXiv:2004.12765 External Links: Link Cited by: §4.1.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. In Proc. of ICCV, External Links: Document, Link Cited by: §2.
  • C. F. Baker, C. J. Fillmore, and J. B. Lowe (1998) The Berkeley FrameNet project. In Proc. of ACL, External Links: Link, Document Cited by: §4.1.
  • M. Balcan, A. Blum, D. Sharma, and H. Zhang (2020) On the power of abstention and data-driven decision making for adversarial robustness. Note: arXiv:2010.06154 Cited by: §4.2.
  • L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, and N. Schneider (2013) Abstract Meaning Representation for sembanking. In Proc. of LAW VII & ID, External Links: Link Cited by: §4.1.
  • R. Bartlett, A. Morse, R. Stanton, and N. Wallace (2021) Consumer-lending discrimination in the fintech era. Journal of Financial Economics. Cited by: §4.4.
  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021) On the dangers of stochastic parrots: can language models be too big?. In Proc. of FAccT, . External Links: Document, ISBN 9781450383097, Link Cited by: §5.
  • E. M. Bender and A. Koller (2020) Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proc. of ACL, External Links: Document, Link Cited by: §5.
  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, S. W. Yih, and Y. Choi (2020) Abductive commonsense reasoning. In Proc. of ICLR, External Links: Document, Link Cited by: §2.
  • A. Birhane, V. U. Prabhu, and E. Kahembwe (2021) Multimodal datasets: misogyny, pornography, and malignant stereotypes. Note: arXiv:2110.01963 External Links: Link Cited by: §4.3.
  • Y. Bisk, R. Zellers, R. Le Bras, J. Gao, and Y. Choi (2020) PIQA: reasoning about physical commonsense in natural language. In Proc. of AAAI, External Links: Link Cited by: §3.4.
  • A. Bosselut, R. Le Bras, and Y. Choi (2021)

    Dynamic neuro-symbolic knowledge graph construction for zero-shot commonsense question answering

    In Proc. of AAAI, External Links: Link Cited by: §3.4.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proc. of EMNLP, External Links: Document, Link Cited by: §3.1.
  • S. R. Bowman (2022) The dangers of underclaiming: reasons for caution when reporting how nlp systems fail. In Proc. of ACL, External Links: Link Cited by: footnote 7.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §4.3, §4.3.
  • D. Card, P. Henderson, U. Khandelwal, R. Jia, K. Mahowald, and D. Jurafsky (2020) With little power comes great responsibility. In Proc. of EMNLP, External Links: Document, Link Cited by: §4.3.
  • D. Card and N. A. Smith (2018) The importance of calibration for estimating proportions from annotations. In Proc. of NAACL, External Links: Link, Document Cited by: footnote 12.
  • K. Chang, H. He, R. Jia, and S. Singh (2021)

    Robustness and adversarial examples in natural language processing

    In Proc. of EMNLP: Tutorial Abstracts, External Links: Link Cited by: §3.1.
  • K. Chen, W. Xu, X. Cheng, Z. Xiaochuan, Y. Zhang, L. Song, T. Wang, Y. Qi, and W. Chu (2020) Question directed graph attention network for numerical reasoning over text. In Proc. of EMNLP, External Links: Document, Link Cited by: §2.
  • C. Chow (1957) An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers 6, pp. 247–254. Cited by: §4.2.
  • C. Cortes, G. DeSalvo, and M. Mohri (2016) Boosting with abstention. In Proc. of NeurIPS, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), External Links: Link Cited by: footnote 12.
  • D. Davidov, O. Tsur, and A. Rappoport (2010) Semi-supervised recognition of sarcasm in Twitter and Amazon. In Proc. of CoNLL, External Links: Link Cited by: §4.1.
  • M. H. DeGroot and S. E. Fienberg (1983) The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32 (1-2), pp. 12–22. Cited by: footnote 12.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, External Links: Document, Link Cited by: §1, §2, §2.
  • J. Dodge, M. Sap, A. Marasović, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner (2021) Documenting large webtext corpora: a case study on the colossal clean crawled corpus. In Proc. of EMNLP, External Links: Link Cited by: §4.3, footnote 8.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL-HLT, External Links: Document, Link Cited by: §2, footnote 5.
  • J. Eisenstein (2022) Uninformative input features and counterfactual invariance: two perspectives on spurious correlations in natural language. In Proc. of NAACL, External Links: Document, Link Cited by: §5, footnote 4.
  • Y. Elazar, H. Zhang, Y. Goldberg, and D. Roth (2021) Back to square one: artifact detection, training and commonsense disentanglement in the Winograd schema. In Proc. of EMNLP, External Links: Link Cited by: §2.
  • M. Gardner, W. Merrill, J. Dodge, M. E. Peters, A. Ross, S. Singh, and N. A. Smith (2021) Competency problems: on finding and removing artifacts in language data. In Proc. of EMNLP, External Links: Link Cited by: On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations, §1, §3.1, §3.2.
  • S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith (2020) RealToxicityPrompts: evaluating neural toxic degeneration in language models. In Findings of EMNLP, External Links: Document, Link Cited by: §4.3.
  • M. Glockner, V. Shwartz, and Y. Goldberg (2018) Breaking NLI systems with sentences that require simple lexical inferences. In Proc. of ACL, External Links: Document, Link Cited by: §2.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proc. of CVPR, External Links: Document, Link Cited by: §1, Figure 2, §2.
  • A. C. Graesser (2013) Prose comprehension beyond the word. Cited by: §1, §3.4, §3.4.
  • H. P. Grice (1975) Logic and conversation. In Speech acts, Cited by: §1.
  • P. Grice (1989) Studies in the way of words. Cited by: §1.
  • Y. Gu, X. Han, Z. Liu, and M. Huang (2022) PPT: pre-trained prompt tuning for few-shot learning. In Proc. of ACL, Cited by: §4.3.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017)

    On calibration of modern neural networks

    In Proc. of ICML, Cited by: footnote 12.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proc. of NAACL-HLT, External Links: Document, Link Cited by: §1, §2, §3.1, §3.1, §4.3.
  • P. He, J. Gao, and W. Chen (2021a) DeBERTaV3: improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. Note: arXiv:2111.09543 External Links: Document, Link Cited by: §2.
  • P. He, X. Liu, J. Gao, and W. Chen (2021b) DeBERTa: decoding-enhanced BERT with disentangled attention. In Proc. of ICLR, External Links: Document, Link Cited by: §2.
  • M. E. Hellman (1970) The nearest neighbor classification rule with a reject option. IEEE Transactions on Systems Science and Cybernetics 6, pp. 179–185. Cited by: §4.2.
  • M. M. Hossain, A. Anastasopoulos, E. Blanco, and A. Palmer (2020) It’s not a non-issue: negation as a source of error in machine translation. In Findings of EMNLP, External Links: Document, Link Cited by: §4.1.
  • M. M. Hossain, D. Chinnappa, and E. Blanco (2022) An analysis of negation in natural language understanding corpora. In Proc. of ACL, External Links: Document, Link Cited by: §4.1.
  • D. A. Hudson and C. D. Manning (2019) GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. of CVPR, External Links: Document, Link Cited by: §1, §2.
  • A. Kamath, R. Jia, and P. Liang (2020) Selective question answering under domain shift. In Proc. of ACL, External Links: Link, Document Cited by: footnote 12.
  • C. Laidlaw and S. Feizi (2019) Playing it safe: adversarial robustness with an abstain option. Note: arXiv:1911.11253 External Links: Link Cited by: §4.2.
  • R. Le Bras, S. Swayamdipta, C. Bhagavatula, R. Zellers, M. E. Peters, A. Sabharwal, and Y. Choi (2020) Adversarial filters of dataset biases. In Proc. of ICML, External Links: Link Cited by: §3.4, footnote 1, footnote 2.
  • K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini (2022) Deduplicating training data makes language models better. In Proc. of ACL, Cited by: footnote 8.
  • H. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Proc. of KR, Cited by: §3.4.
  • X. L. Li, A. Kuncoro, C. d. M. d’Autume, P. Blunsom, and A. Nematzadeh (2021) Do language models learn commonsense knowledge?. Note: arXiv:2111.00607 External Links: Document, Link Cited by: §2, §2.
  • H. Liu and P. Singh (2004) ConceptNet: a practical commonsense reasoning toolkit. BT Technology Journal 22 (4). External Links: Link Cited by: §3.4.
  • N. F. Liu, T. Lee, R. Jia, and P. Liang (2021a) Can small and synthetic benchmarks drive modeling innovation? a retrospective study of question answering modeling approaches. Note: arXiv:2102.01065 External Links: Link Cited by: §5.
  • P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig (2021b) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. Note: arXiv:2107.13586 External Links: Link Cited by: §1, §4.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. Note: arXiv:1907.11692 External Links: Link Cited by: §1, §2, §4.2.
  • W. Merrill, Y. Goldberg, R. Schwartz, and N. A. Smith (2021) Provable limitations of acquiring meaning from ungrounded form:what will future language models understand?. TACL. External Links: Link Cited by: §5.
  • S. Mohammad, E. Shutova, and P. Turney (2016) Metaphor as a medium for emotion: an empirical study. In Proc. of *SEM, External Links: Document, Link Cited by: §4.1.
  • R. Morante and E. Blanco (2012) *SEM 2012 shared task: resolving the scope and focus of negation. In Proc. of *SEM, External Links: Link Cited by: §4.1.
  • S. Oprea and W. Magdy (2020) ISarcasm: a dataset of intended sarcasm. In Proc. of ACL, External Links: Document, Link Cited by: §4.1.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §5.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL-HLT, External Links: Document, Link Cited by: §2.
  • M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. In Proc. of EMNLP, External Links: Document, Link Cited by: §3.4.
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. In Proc. of *SEM, External Links: Document, Link Cited by: §4.3.
  • L. Qin, V. Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. Le Bras, A. Bosselut, and Y. Choi (2020) Back to the future: unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. In Proc. of EMNLP, External Links: Document, Link Cited by: §3.4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8). Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    JMLR 21 (140), pp. 1–67. External Links: Link Cited by: §1, §2, §2, §4.3.
  • I. D. Raji, E. M. Bender, A. Paullada, E. Denton, and A. Hanna (2021) AI and the everything in the whole wide world benchmark. In Proc. Of NeurIPS Benchmarks and Datasets track, External Links: Link Cited by: §5.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for SQuAD. In Proc. of ACL, External Links: Document, Link Cited by: §4.2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP, External Links: Document, Link Cited by: §5.
  • A. Ray, G. Christie, M. Bansal, D. Batra, and D. Parikh (2016) Question relevance in VQA: identifying non-visual and false-premise questions. In Proc. of EMNLP, External Links: Document, Link Cited by: §4.2.
  • A. Roberts, C. Raffel, and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. In Proc. of EMNLP, External Links: Document, Link Cited by: §3.4.
  • A. Rogers (2021) Changing the world by changing the data. In Proc. of ACL, External Links: Link, Document Cited by: §3.1.
  • D. Rubinstein, E. Levi, R. Schwartz, and A. Rappoport (2015) How well do distributional models capture different types of semantic knowledge?. In Proc. of ACL, External Links: Document, Link Cited by: §3.4.
  • K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi (2020) WinoGrande: an adversarial winograd schema challenge at scale. In Proc. of AAAI, External Links: Link Cited by: On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations, §1, §2, §3.4.
  • T. Schick and H. Schütze (2021a) It’s not just size that matters: small language models are also few-shot learners. In Proc. of NAACL, External Links: Document, Link Cited by: §4.3.
  • T. Schick and H. Schütze (2021b) True few-shot learning with prompts – a real-world perspective. Note: arXiv:2111.13440 External Links: Link Cited by: §4.3.
  • R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni (2020) Green AI. CACM 63 (12). External Links: Document, ISSN 0001-0782, Link Cited by: §3.4, §5.
  • R. Schwartz, M. Sap, I. Konstas, L. Zilles, Y. Choi, and N. A. Smith (2017) The effect of different writing tasks on linguistic style: a case study of the ROC story cloze task. In Proc. of CoNLL, External Links: Document, Link Cited by: §1, §4.3.
  • R. Sharma, J. Allen, O. Bakhshandeh, and N. Mostafazadeh (2018) Tackling the story ending biases in the story cloze test. In Proc. of ACL, External Links: Link, Document Cited by: §2.
  • T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh (2020) AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proc. of EMNLP, External Links: Document, Link Cited by: §4.3.
  • A. Shrikumar, A. Alexandari, and A. Kundaje (2019) A flexible and adaptive framework for abstention under class imbalance. Note: arXiv:1802.07024 External Links: Link Cited by: footnote 12.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. Note: arXiv:1312.6034 External Links: 1312.6034, Link Cited by: §4.2.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. Note: arXiv:1706.03825 External Links: Link Cited by: §4.2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of EMNLP, External Links: Link Cited by: §4.2.
  • G. Stanovsky, N. A. Smith, and L. Zettlemoyer (2019) Evaluating gender bias in machine translation. In Proc. of ACL, External Links: Document, Link Cited by: §1, §4.4.
  • M. Steedman and J. Baldridge (2006) Combinatory categorial grammar. In Encyclopedia of Language & Linguistics (Second Edition), K. Brown (Ed.), pp. 610–621. External Links: ISBN 978-0-08-044854-1, Document, Link Cited by: §4.1.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in NLP. In Proc. of ACL, External Links: Document, Link Cited by: §3.4.
  • E. Sulem, J. Hay, and D. Roth (2021) Do we know what we don’t know? studying unanswerable questions beyond SQuAD 2.0. In Findings of EMNLP, External Links: Link Cited by: §4.2.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proc. of ICML, External Links: Link Cited by: §4.2.
  • S. Swayamdipta, R. Schwartz, N. Lourie, Y. Wang, H. Hajishirzi, N. A. Smith, and Y. Choi (2020) Dataset cartography: mapping and diagnosing datasets with training dynamics. In Proc. of EMNLP, External Links: Document, Link Cited by: §3.4, footnote 1.
  • N. Tandon, K. Sakaguchi, B. Dalvi, D. Rajagopal, P. Clark, M. Guerquin, K. Richardson, and E. Hovy (2020) A dataset for tracking entities in open domain procedural text. In Proc. of EMNLP, External Links: Document, Link Cited by: §1.
  • Y. Tsvetkov, L. Boytsov, A. Gershman, E. Nyberg, and C. Dyer (2014) Metaphor detection with cross-lingual model transfer. In Proc. of ACL, External Links: Document, Link Cited by: §4.1.
  • L. Wang, Y. Li, O. Aslan, and O. Vinyals (2021) WikiGraphs: a Wikipedia text - knowledge graph paired dataset. In Proc. of TextGraphs, External Links: Link Cited by: §3.4.
  • Z. Wang and A. Culotta (2020) Identifying spurious correlations for robust text classification. In Findings of EMNLP, External Links: Link, Document Cited by: §3.1.
  • O. Weller and K. Seppi (2019) Humor detection: a transformer gets the last laugh. In Proc. of EMNLP, External Links: Document, Link Cited by: §4.1.
  • Y. Yaghoobzadeh, S. Mehri, R. Tachet des Combes, T. J. Hazen, and A. Sordoni (2021) Increasing robustness to spurious correlations using forgettable examples. In Proc. of EACL, External Links: Link, Document Cited by: §3.1.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proc. of EMNLP, External Links: Document, Link Cited by: §1, §2, §3.4, footnote 5.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proc. of ACL, External Links: Document, Link Cited by: §2.
  • S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme (2018) ReCoRD: bridging the gap between human and machine commonsense reading comprehension. Note: arXiv:1810.12885 External Links: Document, Link Cited by: §2.
  • S. Zhang, C. Gong, and E. Choi (2021) Knowing more about questions can help: improving calibration in question answering. In Findings of ACL, External Links: Link, Document Cited by: footnote 12.
  • Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019) ERNIE: enhanced language representation with informative entities. In Proc. of ACL, External Links: Link, Document Cited by: §3.4.
  • J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. In Proc. of EMNLP, External Links: Document, Link Cited by: §1, §4.4.