Mitigating Annotation Artifacts in Natural Language Inference Datasets to Improve Cross-dataset Generalization Ability

by   Guanhua Zhang, et al.

Natural language inference (NLI) aims at predicting the relationship between a given pair of premise and hypothesis. However, several works have found that there widely exists a bias pattern called annotation artifacts in NLI datasets, making it possible to identify the label only by looking at the hypothesis. This irregularity makes the evaluation results over-estimated and affects models' generalization ability. In this paper, we consider a more trust-worthy setting, i.e., cross-dataset evaluation. We explore the impacts of annotation artifacts in cross-dataset testing. Furthermore, we propose a training framework to mitigate the impacts of the bias pattern. Experimental results demonstrate that our methods can alleviate the negative effect of the artifacts and improve the generalization ability of models.


page 1

page 2

page 3

page 4


Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

Recent studies show that crowd-sourced Natural Language Inference (NLI) ...

Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets

Natural Language Sentence Matching (NLSM) has gained substantial attenti...

MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain

Crowdworker-constructed natural language inference (NLI) datasets have b...

Misleading Failures of Partial-input Baselines

Recent work establishes dataset difficulty and removes annotation artifa...

HypoNLI: Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference

Many recent studies have shown that for models trained on datasets for n...

CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems

Neural network-based models augmented with unsupervised pre-trained know...

Improving Generalization by Incorporating Coverage in Natural Language Inference

The task of natural language inference (NLI) is to identify the relation...

1 Introduction

Natural language inference (NLI) is a widely-studied problem in natural language processing. It aims at comparing a pair of sentences (

i.e. a premise and a hypothesis), and inferring the relationship between them (i.e., entailment, neutral and contradiction). Large-scaled datasets like SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) have been created by crowd-sourcing and fertilized NLI research substantially.

However, several works (Gururangan et al., 2018; Tsuchiya, 2018; Wang et al., 2018) have pointed out that crowd-sourcing workers have brought a bias pattern named annotation artifacts in these NLI datasets. Such artifacts in hypotheses can reveal the labels and make it possible to predict the labels solely by looking at the hypotheses. For example, models trained on SNLI with only the hypotheses can achieve an accuracy of 67.0%, despite the always predicting the majority-class baseline is only 34.3%  (Gururangan et al., 2018).

Classifiers trained on NLI datasets are supposed to make predictions by understanding the semantic relationships between given sentence pairs. However, it is shown that models are unintentionally utilizing the annotation artifacts (Wang et al., 2018; Gururangan et al., 2018). If the evaluation is conducted under a similar distribution as the training data, e.g., with the given testing set, models will enjoy additional advantages, making the evaluation results over-estimated. On the other hand, if the bias pattern cannot be generalized to the real-world, it may introduce noise to models, thus hurting the generalization ability.

In this paper, we use cross-dataset testing to better assess models’ generalization ability. We investigate the impacts of annotation artifacts in cross-dataset testing. Furthermore, we propose an easy-adopting debiasing training framework, which doesn’t require any additional data or annotations, and apply it to the high-performing Densely Interactive Inference Network (DIIN; Gong et al., 2017). Experiments show that our method can effectively mitigate the bias pattern and improve the cross-dataset generalization ability of models. To the best of our knowledge, our work is the first attempt to alleviate the annotation artifacts without any extra resources.

2 Related Work

Frequently-used NLI datasets such as SNLI and MultiNLI are created by crowd-sourcing (Bowman et al., 2015; Williams et al., 2018), during which they present workers a premise and ask them to produce three hypotheses corresponding to labels. As Gururangan et al. (2018)

pointed out, workers may adopt some specific annotation strategies and heuristics when authoring hypotheses to save efforts, which produces certain patterns called annotation artifacts in the data. Models’ trained on such datasets are heavily affected by the bias pattern 

(Gururangan et al., 2018).

Wang et al. (2018) further investigate models’ robustness to the bias pattern using swapping operations. Poliak et al. (2018) demonstrate that the annotation artifacts widely exist among NLI datasets. They show that hypothesis-only-model, which refers to models trained and predict only with hypotheses, outperforms always predicting the majority-class in six of ten NLI datasets.

The emergence of the pattern can be due to selection bias (Rosenbaum and Rubin, 1983; Zadrozny, 2004; d’Agostino, 1998) in the datasets preparing procedure. Several works (Levy and Dagan, 2016; Rudinger et al., 2017) investigate the bias problem in relation inference datasest. Zhang et al. (2019) investigate the selection bias embodied in the comparing relationships in six natural language sentence matching datasets and propose a debiasing training and evaluation framework.

3 Making Artifacts Unpredictable

Essentially speaking, the problem of the bias pattern is that the artifacts in hypotheses are distributed differently among labels, so balancing them across labels may be a good solution to alleviate the impacts (Gururangan et al., 2018).

Based on the idea proposed by Zhang et al. (2019), we demonstrate that we can make artifacts in biased datasets balanced across different classes by assigning specific weights for every sample. We refer the distribution of the acquired weighted dataset as artifact-balanced distribution. We consider a supervised NLI task, which is to predict the relationship label given a sentence pair , and we denote the hypothesis in as

. Without loss of generality, we assume that the prior probability of different labels is equal, and then we have the following theorem.

Theorem 1.

For any classifier

, and for any loss function

, if we use as weight for every sample during training, it’s equivalent to training with the artifact-balanced distribution.

Detailed assumptions and the proof of the theorem is presented in Appendix A. With the theorem, we can simply use cross predictions to estimate in origin datasets and use them as sample weights during training. The step-by-step procedure for artifact-balanced learning is presented in Algorithm 1.

However, it is difficult to precisely estimate the probability . A minor error might lead to a significant difference to the weight, especially when the probability is close to zero. Thus, in practice, we use as sample weights during training in order to improve the robustness. We can find that as increases, the weights tend to be uniform, indicating that the debiasing effect decreases as the smooth term grows. Moreover, in order to keep the prior probability unchanged, we normalize the sum of weights of the three labels to the same.

Algorithm 1: Artifact-balanced Learning
Input: The dataset and the number of fold for cross prediction.
01 Estimate for every sample by training classifiers and using -fold cross-predicting strategy.
02 Obtain the weights for all samples and normalize the sum of the weights.
03 Train and validate models using as the sample weights.

4 Experimental Results

Trainset Model Smooth Cross-dataset Testing Hard-Easy Testing Human Elicited Human Judged Hard Easy SNLI MMatch MMismatch SICK JOCI SNLI Hyp 0.000 0.4776 0.4968 0.5013 0.4923 0.5016 0.5190 0.4587 0.001 0.4779 0.4952 0.4924 0.4934 0.5044 0.5242 0.4568 0.010 0.5318 0.5092 0.5097 0.5036 0.4961 0.5225 0.5375 0.100 0.7749 0.6009 0.6124 0.6060 0.5304 0.5179 0.8811 Baseline 0.8496 0.6305 0.6399 0.6250 0.5080 0.4793 0.9755 Norm 0.000 76.61% 50.51% 51.50% 52.63% 44.15% 74.36% 77.72% 0.001 72.75% 45.05% 46.24% 48.25% 39.68% 72.95% 72.65% 0.010 78.94% 54.53% 55.97% 52.68% 46.19% 75.38% 80.71% 0.100 83.57% 57.77% 60.37% 53.45% 47.84% 76.02% 87.32% Baseline 86.98% 61.95% 64.00% 52.07% 45.63% 73.81% 93.52% MultiNLI Hyp 0.000 0.4647 0.4427 0.4429 0.4685 0.4874 0.4957 0.3998 0.001 0.4433 0.4174 0.4152 0.4583 0.4933 0.4969 0.3534 0.010 0.4560 0.4562 0.4590 0.4723 0.4970 0.4992 0.4201 0.020 0.4741 0.4850 0.4957 0.5003 0.4969 0.5006 0.4703 0.100 0.5711 0.6482 0.6596 0.5944 0.5208 0.5023 0.7619 Baseline 0.6483 0.7252 0.7253 0.6079 0.4587 0.4998 0.8915 Norm 0.000 52.06% 58.92% 60.63% 52.99% 48.27% 56.80% 60.78% 0.001 53.90% 59.48% 60.50% 52.70% 45.67% 58.19% 60.61% 0.010 58.13% 62.82% 64.35% 54.17% 45.78% 61.27% 64.18% 0.020 61.37% 66.68% 68.18% 57.20% 48.59% 62.21% 70.60% 0.100 64.16% 71.54% 72.77% 58.35% 48.81% 66.14% 76.28% Baseline 68.49% 76.20% 76.38% 56.74% 41.18% 66.24% 84.92%

Table 1: Evaluation Results of Hyp and Norm. Baseline refers to the model trained and validated without using weights. Hard, Easy refers to the Hard-Easy Testing generated from the testing set corresponding to the Trainset column. Results of Hyp are the average numbers of five runs with different random initialization. We report AUC for Hyp and ACC for Norm. “*” indicates where normal-model are better than the baseline.

In this section, we present the experimental results for cross-dataset testing of artifacts and artifact-balanced learning. We show that cross-dataset testing is less affected by annotation artifacts, while there are still some influences more or less in different datasets. We also demonstrate that our proposed framework can mitigate the bias and improve the generalization ability of models.

4.1 Evaluation Scheme

Cross-dataset Testing

We utilize SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018), JOCI (Zhang et al., 2017) and SICK (Marelli et al., 2014) for cross-dataset testing.

SNLI and MultiNLI are prepared by Human Elicited, in which workers are given a context and asked to produce hypotheses corresponding to labels. SICK and JOCI are created by Human Judged, referring that hypotheses and premises are automatically paired while labels are generated by humans (Poliak et al., 2018). In order to maximumly mitigate the impacts of annotation artifacts during evaluations, we train and validate models respectively on SNLI and MultiNLI and test on both SICK and JOCI. We also report models’ performances on SNLI and MultiNLI.

As to SNLI, we use the same partition as Bowman et al. (2015). For MultiNLI, we separately use two origin validation sets (Matched and Mismatched) as the testing sets for convenience, and refer them as MMatch and MMismatch. We randomly select 10000 samples out of the origin training set for validation and use the rest for training. As to JOCI, we use the whole “B” subsets for testing, whose premises are from SNLI-train while hypotheses are generated based on world knowledge (Zhang et al., 2017), and convert the score to NLI labels following Poliak et al. (2018). As to SICK, we use the whole dataset for testing.

Hard-Easy Testing

To determine how biased the models are, we partition the testing set of SNLI and MMatch into two subsets: examples that the hypothesis-only model can be correctly classified as Easy and the rest as Hard as seen in Gururangan et al. (2018). More detailed information is presented in Appendix B.1.

4.2 Experiment Setup

We refer models trained only with hypotheses as hypothesis-only-model (Hyp), and models that utilize both premises and hypotheses as normal-model (Norm). We implement a simple LSTM model for Hyp and use DIIN Gong et al. (2017) 111 as Norm. We report AUC222We calculate metrics for each label and report their mean. for Hyp and ACC for Norm. More details can be seen in Appendix B.2

We estimate for SNLI and MultiNLI respectively using BERT (Devlin et al., 2018) with 10-fold predictions. To investigate the impacts of smooth terms, we choose a series of smooth values and present the results. Considering models may jiggle during the training phase due to the varied scale of weights, we sample examples with probabilities proportional to the weights for every mini-batch instead of adding weights to the loss directly.

The evaluation results are reported in Table 1.

4.3 Can Artifacts Generalize Across Datasets?

Anotation Artifacts can be generalized across Human Elicited datasets. From the AUC of Hyp baseline trained with SNLI, we can see that the bias pattern of SNLI has a strong predictive ability in itself and the other two testing sets of Human Elicited. The behavior of those trained with MultiNLI is similar.

Anotation Artifacts of SNLI and MultiNLI can be generalized to SICK. Unexpectedly, it is shown that Hyp baseline can get  (AUC) trained with SNLI and  (AUC) with MultiNLI when tested on SICK, indicating that the bias pattern of SNLI and MultiNLI are predictive on SICK. The results imply that the bias pattern can even be generalized across datasets prepared by different methods.

Annotation Artifacts of SNLI are nearly neutral in JOCI, while MultiNLI is misleading. We find that AUC of Hyp baseline trained with SNLI is very close to on JOCI, indicating that JOCI is nearly neutral to artifacts in SNLI. However, when it comes to training with MultiNLI, the AUC of Hyp baseline is lower than , indicating that the artifacts are misleading in JOCI.

4.4 Debiasing Results

Effectiveness of Debiasing

Focusing on the results when smooth equals for SNLI and smooth equals for MultiNLI, we observe that the AUC of Hyp for all testing sets are approximately , indicating Hyp’s predictions are approximately equivalent to randomly guessing. Also, the gap between Hard and Easy for Norm significantly decreases comparing with the baseline. With the smooth, we can conclude that our method effectively alleviates the bias pattern.

With other smooth terms, our method still has more or less debiasing abilities. In those testing sets which are not neutral to the bias pattern, the AUC of Hyp always come closer to comparing with the baseline with whatever smooth values. Performances of Norm on Hard and Easy also come closer comparing with the baseline. Norm trained with SNLI even exceed baseline in Hard with most smooth terms.

From the results of Hyp, we can find a trend that the larger the smooth value is, the lower the level of debiasing is, while with a very small or even no smooth value, the AUC may be lower than . As mentioned before, we owe this to the imperfect estimation of , and we can conclude that a proper smooth value is a prerequisite for the best debiasing effect.

Benefits of Debiasing

Debiasing may improve models’ generalization ability from two aspects: (1) Mitigate the misleading effect of annotation artifacts. (2) Improve models’ semantic learning ability.

When the annotation artifacts of the training set cannot be generalized to the testing set, which should be more common in the real-world, predicting by artifacts may hurt models’ performance. Centering on the results of JOCI, in which the bias pattern of MultiNLI is misleading, we find that Norm trained with MultiNLI outperforms baseline after debiasing with all smooth values tested.

Furthermore, debiasing can reduce models’ dependence on the bias pattern during training, thus force models to better learn semantic information to make predictions. Norm trained with SNLI exceed baseline in JOCI with smooth terms and . With larger smooth terms, Norm trained with both SNLI and MultiNLI exceeds baseline in SICK. Given the fact that JOCI is almost neutral to artifacts in SNLI, and the bias pattern of both SNLI and MultiNLI are even predictive in SICK, we owe these promotions to that our method improves models’ semantic learning ability.

As to other testing sets like SNLI, MMatch and MMismatch, we notice that the performance of Norm always decreases compared with the baseline. As mentioned before, both SNLI and MultiNLI are prepared by Huamn Elicited, and their artifacts can be generalized across each other. We owe the drop to that the detrimental effect of mitigating the predictable bias pattern exceeds the beneficial effect of the improvement of semantic learning ability.

5 Conclusion

In this paper, we take a close look at the annotation artifacts in NLI datasets. We find that the bias pattern could be predictive or misleading in cross-dataset testing. Furthermore, we propose a debiasing framework and experiments demonstrate that it can effectively mitigate the impacts of the bias pattern and improve the cross-dataset generalization ability of models. However, it remains an open problem that how we should treat the annotation artifacts. We cannot assert whether the bias pattern should not exist at all or it is actually some kind of nature. We hope that our findings will encourage more explorations on reliable evaluation protocols for NLI models.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §B.2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: §1, §2, §4.1, §4.1.
  • R. B. d’Agostino (1998) Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statistics in medicine 17 (19), pp. 2265–2281. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.2.
  • Y. Gong, H. Luo, and J. Zhang (2017) Natural language inference over interaction space. arXiv preprint arXiv:1709.04348. Cited by: §B.2, §1, §4.2.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2, pp. 107–112. Cited by: §B.1, §1, §1, §2, §3, §4.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §B.2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Cited by: §B.1.
  • O. Levy and I. Dagan (2016) Annotating relation inference in context via question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 249–255. Cited by: §2.
  • M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli (2014) Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp. 1–8. Cited by: §4.1.
  • J. Pennington, R. Socher, and C. Manning (2014)

    Glove: global vectors for word representation

    In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §B.2.
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp. 180–191. Cited by: §2, §4.1, §4.1.
  • P. R. Rosenbaum and D. B. Rubin (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §2.
  • R. Rudinger, C. May, and B. Van Durme (2017) Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 74–79. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)

    Dropout: a simple way to prevent neural networks from overfitting

    The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §B.2.
  • T. Tieleman and G. Hinton (2012)

    Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude

    COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §B.2.
  • M. Tsuchiya (2018) Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Cited by: §1.
  • H. Wang, D. Sun, and E. P. Xing (2018) What if we simply swap the two text fragments? a straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. arXiv preprint arXiv:1809.02719. Cited by: §1, §1, §2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, pp. 1112–1122. Cited by: §1, §2, §4.1.
  • B. Zadrozny (2004) Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, pp. 114. Cited by: §2.
  • G. Zhang, B. Bai, J. Liang, K. Bai, S. Chang, M. Yu, C. Zhu, and T. Zhao (2019) Selection bias explorations and debias methods for natural language sentence matching datasets. arXiv preprint arXiv:1905.06221. Cited by: §2, §3.
  • S. Zhang, R. Rudinger, K. Duh, and B. Van Durme (2017) Ordinal common-sense inference. Transactions of the Association for Computational Linguistics 5, pp. 379–395. Cited by: §4.1, §4.1.

Appendix A Detailed Assumptions and Proof of Theorem 1

We make a few assumptions about an artifact-balanced distribution and how the biased datasets are generated from it, and demonstrate that we can train models fitting the artifact-balanced distribution using only the biased datasets.

We consider the domain of the artifact-balanced distribution as , in which is the input variable space, is the label space, is the feature space of annotation artifacts in hypotheses, is the selection intention space. We assume that the biased distribution of origin datasets can be generated from the artifact-balanced distribution by selecting samples with , i.e., the selection intention matches with the label. We use to represent the probability on and use for .

We also make some assumptions about the artifact-balanced distribution. The first one is that the label is independent with the artifact in the hypothesis, defined as follows,

The second one is that the selection intention is independent with and when the annotation artifact is given,

And we can prove the equivalence of training with weight and fitting the artifact-balanced distribution. We first present an equation as follows,

Without loss of generality, we can assume and get that,

With the above derivation, we can prove the equivalence like following,


As is just a constant, training with the loss is equivalent to fitting the artifact-balanced distribution. Given hypotheses variable H, the probability can be replaced by since the predictive ability of hypotheses totally comes from the annotation artifacts, and we can have as weights during training.

Appendix B Experiment Setting

b.1 Hard-Easy Datasets Setting

For SNLI, we use Hard released by Gururangan et al. (2018). For MMatch, we manually partition the set using fastText (Joulin et al., 2017). And we summarize the size of the datasets used in Hard-Easy Testing below.

Trainset Hard Easy
SNLI 3261 6563
MultiNLI 4583 5232

b.2 Experiment Setup

For DIIN, we use settings same as Gong et al. (2017)

but do not use syntactical features. Priors of labels are normalized to be the same. For hypothesis-only-model, we implement a naïve model with one LSTM layer and a three-layer MLP behind, implemented with Keras and Tensorflow backend 

(Abadi et al., 2016). We use the 300-dimensional GloVe embeddings trained on the Common Crawl 840B tokens dataset (Pennington et al., 2014) and keep them fixed during training. Batch Normalization (Ioffe and Szegedy, 2015) are applied after every hidden layer in MLP and we use Dropout (Srivastava et al., 2014) with rate 0.5 after the last hidden layer. We use RMSProp(Tieleman and Hinton, 2012)

as optimizer and set the learning rate as 1e-3. We set the gradient clipping to 1.0 and the batch size to 256.