The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve. We also observe lower-than-expected correlations between the analysis validation set and standard validation set, questioning the effectiveness of the current model-selection routine. Next, to answer the second question, we give both theoretical explanations and empirical evidence regarding the source of the instability, demonstrating that the instability mainly comes from high inter-example correlations within analysis sets. Finally, for the third question, we discuss an initial attempt to mitigate the instability and suggest guidelines for future work such as reporting the decomposed variance for more interpretable results and fair comparison across models. Our code is publicly available at: https://github. com/owenzx/InstabilityAnalysis


page 5

page 6


ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

This paper presents the ReCO, a human-curated ChineseReading Comprehensi...

Why Machine Reading Comprehension Models Learn Shortcuts?

Recent studies report that many machine reading comprehension (MRC) mode...

Stochastic Answer Networks for SQuAD 2.0

This paper presents an extension of the Stochastic Answer Network (SAN),...

Coreference Reasoning in Machine Reading Comprehension

The ability to reason about multiple references to a given entity is ess...

Towards Interpreting BERT for Reading Comprehension Based QA

BERT and its variants have achieved state-of-the-art performance in vari...

Lite Unified Modeling for Discriminative Reading Comprehension

As a broad and major category in machine reading comprehension (MRC), th...

SNIP: An Adaptation of Sorted Neighborhood Methods for Deduplicating Pedigree Data

Pedigree data contain family history information that is used to analyze...

1 Introduction

Neural network models have significantly pushed forward performances on natural language processing benchmarks with the development of large-scale language model pre-training Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019); Radford et al. (2019); Liu et al. (2019)

. For example, on two semantically challenging tasks, Natural Language Inference (NLI) and Reading Comprehension (RC), the state-of-the-art results have reached or even surpassed the estimated human performance on certain benchmark datasets 

Wang et al. (2019); Rajpurkar et al. (2016, 2018). These astounding improvements, in turn, motivate a new trend of research to analyze what language understanding and reasoning skills are actually achieved, versus what is still missing within these current models. Following this trend, numerous analysis approaches have been proposed to examine models’ ability to capture different linguistic phenomena (e.g., named entities, syntax, lexical inference, etc.). Those studies are often conducted in three steps: (1) proposing assumptions about models’ certain ability; (2) building analysis datasets by automatic generation or crowd-sourcing; (3) concluding models’ certain ability based on results on these analysis datasets.

Past analysis studies have led to many key discoveries in NLP models, such as over-stability Jia and Liang (2017), surface pattern overfitting Gururangan et al. (2018), but recently McCoy et al. (2019a) found that the results of different runs of BERT NLI models have large non-negligible variances on the HANS McCoy et al. (2019b) analysis datasets, contrasting sharply with their stable results on standard validation set across multiple seeds. This finding raises concerns regarding the reliability of individual results reported on those datasets, the conclusions made upon these results, and lack of reproducibility Makel et al. (2012). Thus, to help consolidate further developments, we conduct a deep investigation on model instability, showing how unstable the results are, and how such instability compromises the feedback loop between model analysis and model development.

We start our investigation from a thorough empirical study of several representative models on both NLI and RC. Overall, we observe four worrisome observations in our experiments: (1) The final results of the same model with different random seeds on several analysis sets are of significantly high variance. The largest variance is more than 27 times of that for standard development set; (2) These large instabilities on certain datasets is model-agnostic. Certain datasets have unstable results across different models; (3) The instability not only occurs at the final performance but exists all along training trajectory, as shown in Fig. 1; (4) The results of the same model on analysis sets and on the standard development set have low correlation, making it hard to draw any constructive conclusion and questioning the effectiveness of the standard model-selection routine.

Next, in order to grasp a better understanding of this instability issue, we explore theoretical explanations behind this instability. Through our theoretical analysis and empirical demonstration, we show that inter-examples correlation within the dataset is the dominating factor causing this performance instability. Specifically, the variance of model accuracy on the entire analysis set can be decomposed into two terms: (1) the sum of single-data variance (the variance caused by individual prediction randomness on each example), and (2) the sum of inter-data covariance (caused by the correlation between different predictions). To understand the latter term better, consider the following case: if there are many examples correlated with each other in the evaluation set, then the change of model prediction on one example will influence predictions on all the correlated examples, causing high variances in final accuracy. We estimate these two terms with multiple runs of experiments and show that inter-data covariance contributes significantly more than single-data variance to final accuracy variance, indicating its major role in the cause of instability.

Finally, in order for the continuous progress of the community to be built upon trustworthy and interpretable results, we provide initial suggestions on how to perceive the implication of this instability issue and how we should potentially handle it. For this, we encourage future research to: (1) when reporting means and variance over multiple runs, also report two decomposed variance terms (i.e., sum of single data variance and sum of inter-data covariance) for more interpretable results and fair comparison across models; (2) focus on designing models with better inductive and structural biases, and datasets with higher linguistic diversity.

Overall, our contribution is 3-fold. First, we provide a thorough empirical study of the instability issue in models’ performance on analysis datasets. Second, we demonstrate theoretically and empirically that the performance variance is attributed mostly to the inter-example correlation. Finally, we provide some suggestions on how to deal with this instability issue, including reporting of the decomposed variance for more interpretable evaluation and better comparison.

2 Related Work

NLI and RC Analysis.

Many analysis works have been conducted to study what the models are actually capturing alongside recent improvements on NLI and RC benchmark scores. In NLI, some analyses target word/phrase level lexical/semantic inference Glockner et al. (2018); Shwartz and Dagan (2018); Carmona et al. (2018), some are more syntactic-related McCoy et al. (2019b); Nie et al. (2019); Geiger et al. (2019), some also involved logical-related study Minervini and Riedel (2018); Wang et al. (2019). naik2018stress proposed a suite of analysis sets covering different linguistic phenomena. In RC, adversarial style analysis is used to test the robustness of the models Jia and Liang (2017). Most of the work follows the style of Carmona et al. (2018) to diagnose/analyze models’ behavior on pre-designed analysis sets. In this paper, we analyze NLI and RC models from a broader perspective by inspecting models’ performance across different analysis sets, and their inter-dataset and intra-dataset relationships.

Dataset-Related Analysis.

As deep learning models heavily rely on high-quality training sets, another line of works has aimed at studying the meta-issues of the data and the dataset creation itself. The most well-known one of this kind is the analysis of undesirable bias. In VQA datasets, unimodal biases were found, compromising their authority on multi-modality evaluation 

Jabri et al. (2016); Goyal et al. (2017). In machine comprehension, kaushik2018much found that passage-only models can achieve decent accuracy. In NLI, hypothesis bias was also found in SNLI and MultiNLI Tsuchiya (2018); Gururangan et al. (2018). All these findings raised concerns regarding spurious shortcuts that emerged in dataset collection and their unintended and harmful effects on trained models.

To mitigate these problems, several recent works have proposed new guidelines for better collections and uses of datasets. Specifically, liu2019inoculation introduced a systematic and task-agnostic method to analyze datasets. rozen2019diversify further explain how to improve challenging datasets and why diversity matters. geva2019we suggest that annotator bias should be monitored throughout the collection process and that part of the test data be created by exclusive annotators. Our work is complementary to those analyses.

Robustifying NLI and RC Models.

Recently, a number of works have been proposed to directly improve the performance on the analysis datasets both for NLI through model ensembling Clark et al. (2019); He et al. (2019), novel training mechanisms Pang et al. (2019); Yaghoobzadeh et al. (2019), enhancing word representations Moosavi et al. (2019), and for RC through using different training objectives  Yeh and Chen (2019); Lewis and Fan (2019). While improvements have been made on certain analysis datasets, the stability of the results is not examined. As explained in this paper, we highly recommend those result variances be scrutinized in future work for fidelity considerations.

Instability in Performance.

Performance instability has already been recognized as an important issue in deep reinforcement learning 

Irpan (2018)

and active learning 

Bloodgood and Grothendieck (2013)

. However, supervised learning is presumably stable especially with fixed datasets and labels. This assumption is challenged by some analyses recently. mccoy2019berts show high variances in NLI-models performance on the analysis dataset. phang2018sentence found high variances in fine-tuning pre-trained models in several NLP tasks on the GLUE Benchmark. reimers2017reporting, reimers2018comparing state that conclusions based on single run performance may not be reliable for machine learning approaches. weber2018fine found that the model’s ability to generalize beyond the training distribution depends greatly on the chosen random seed. dodge2020fine showed weight initialization and training data order both contribute to the randomness in BERT performance. In our work, we present a comprehensive explanation and analysis of the instability of neural models on analysis datasets and give general guidance for future work.

Figure 2: The results of BERT, RoBERTa, and XLNet on all datasets with 10 different random seeds. Large variance can be seen at certain analysis datasets (e.g. STR-NU, HANS, etc.) while results on standard validation sets are always stable.

3 The Curse of Instability

3.1 Tasks and Datasets

In this work, we target our experiments on NLI and RC for two reasons: 1) their straightforwardness for both automatic evaluation and human understanding, and 2) their wide acceptance of being benchmark tasks for evaluating natural language understanding.

For NLI, we use SNLI Bowman et al. (2015) and MNLI Williams et al. (2018) as the main standard datasets and use HANS McCoy et al. (2019b), SNLI-hard Gururangan et al. (2018), BREAK-NLI Glockner et al. (2018), Stress Test Naik et al. (2018) as our auxiliary analysis sets. Note that the Stress Test contains 6 subsets (denoted as ‘STR-X’) targeting different linguistic categories. For RC, we use SQuAD1.1 Rajpurkar et al. (2016) as the main standard dataset and use AdvSQuAD Jia and Liang (2017) as the analysis set. Detail descriptions of the models and datasets are in Appendix.

3.2 Models and Training

Since BERT Devlin et al. (2019) achieves state-of-the-art results on several NLP tasks, the pretraining-then-finetuning framework has been widely used. To keep our analysis aligned with recent progress, we focused our experiments on this framework. Specifically, in our study, we used the two most typical choices: BERT Devlin et al. (2019) and XLNet Yang et al. (2019).222For all the transformer models, we use the implementation in BERT-B, BERT-L stands for BERT-base and BERT-large, respectively. The same naming rule applies to other transformer models. Moreover, for NLI, we additionally use RoBERTa Liu et al. (2019) and ESIM Chen et al. (2017) in our experiments. RoBERTa is almost the same as BERT except that it has been trained on 10 times more data during the pre-training phrase to be more robust. ESIM is the most representative pre-BERT model for sequence matching problem and we used an ELMo-enhanced-version Peters et al. (2018).333For ESIM, we use the implementation in AllenNLP Gardner et al. (2018).

Training Details.

For all pre-trained transformer models, namely, BERT, RoBERTa, and XLNet, we use the same set of hyper-parameters for analysis consideration. For NLI, we use the suggested hyper-parameters in devlin2019bert. The batch size is set to 32 and the peak learning rate is set to 2e-5. We save checkpoints every 500 iterations, resulting in 117 intermediate checkpoints. In our preliminary experiments, we find that tuning these hyper-parameters will not significantly influence the results. The training set for NLI is the union of SNLI Bowman et al. (2015) and MNLI Williams et al. (2018) training set and is fixed across all the experiments. This will give us a good estimation of state-of-the-art performance on NLI that is fairly comparable to other analysis studies. For RC, we use a batch size of 12 and set the peak learning rate to 3e-5. RC Models are trained on SQuAD1.1 Rajpurkar et al. (2016)

for 2 epochs.

3.3 What are the Concerns?

Instability in Final Performance.

Models’ final results often serve as a vital (if not the only) measurement for comparative study. Thus, we start with the question: “How unstable are the final results?” To measure the instability, we train every model times with different random seeds. Then, we evaluate the performances of all the final checkpoints on each NLI dataset and compute their standard deviations. As shown in Fig. 2, the results of different runs for BERT, RoBERTa, and XLNet are highly stable on MNLI-m, MNLI-mm, and SNLI, indicating that model performance on standard validation datasets regardless of domain consistency444Here SNLI and MNLI-m share the same domain as the training set while MNLI-mm is from different domains. are fairly stable. This stability also holds on some analysis sets, especially on SNLI-hard, which is a strict subset of the SNLI validation set. On the contrary, there are noticeable high variances on the results on some analysis sets. The most significant ones are on STR-NU and HANS where points are sparsely scattered, with a 10-point gap between the highest and the lowest number for STR-NU and a 4-point gap for HANS.

Model-Agnostic Instability.

Next, we check if the instability issue is model-agnostic. For a fair comparison, as the different sizes of the datasets will influence the magnitude of the instability, we normalize the standard deviation on different datasets by multiplying the square root of the size of the dataset and focus on the relative scale compared to the results on the MNLI-m development set, i.e., . The results for all the models are shown in Table 1 (the original means and standard deviations are in Appendix). From Table 1, we can see that the instability phenomenon is consistent across all the models. Regardless of the model choice, some of the analysis datasets (e.g., HANS, STR-O, STR-N) are significantly more unstable (with standard deviation 27 times larger in the extreme case) than the standard evaluation datasets. Similarly, for RC, the normalized deviation of model F1 results on SQuAD almost doubled when evaluated on AddSent, as shown in Table 2 (the original means and standard deviations are in Appendix).

Model Standard Datasets Analysis Sets
ESIM 1.00 0.57 0.73 3.84 0.82 0.73 0.77 0.73 3.57 4.63 2.58 2.79
ESIM+ELMo 1.00 2.00 1.50 11.5 4.55 2.48 3.10 2.20 7.50 15.5 6.38 8.36
BERT-B 1.00 0.83 0.48 1.43 10.95 0.95 1.39 1.04 2.70 3.70 1.46 13.65
RoBERTa-B 1.00 1.46 0.64 2.82 15.42 1.47 1.27 2.17 5.45 8.45 5.55 25.75
XLNet-B 1.00 0.48 0.37 2.03 6.60 0.75 0.59 0.92 1.96 7.19 2.07 13.33
BERT-L 1.00 1.13 0.56 2.86 18.47 1.37 1.31 2.63 9.19 10.13 2.39 21.88
RoBERTa-L 1.00 0.88 0.69 1.03 10.27 1.01 1.12 1.20 12.13 10.13 4.51 27.38
XLNet-L 1.00 0.90 0.69 1.06 10.67 0.85 0.89 1.45 16.21 11.84 4.26 15.93
Table 1: Relatively normalized deviations of the results with respect to that of MNLI-m for all models. The highest deviations are in bold and the second highest deviations are underlined for each individual model.
Model Standard Dataset Analysis Sets
SQuAD AddSent AddOneSent
BERT-B 1.00 2.61 1.58
XLNet-B 1.00 1.78 1.00
Table 2: Relatively normalized deviations of the results with respect to that of SQuAD dev set for both BERT-B and XLNet-B.

Fluctuation in Training Trajectory.

Intuitively, the inconsistency and instability in the final performance of different runs can be caused by the randomness in initialization and stochasticity in training dynamics. To see how much these factors can contribute to the inconsistency in the final performance, we keep track of the results on different evaluation sets along the training process and compare their training trajectories. We choose HANS and STR-NU as our example unstable analysis datasets because their variances in final performance are the largest, and we choose SNLI and MNLI-m for standard validation set comparison. As shown in Fig. 1, the training curve on MNLI and SNLI (the top two lines) is highly stable, while there are significant fluctuations in the HANS and STR-NU trajectories (bottom two lines). Besides the mean and standard deviation over multiple runs, we also show the accuracy of one run as the bottom dashed line in Fig. 1. We find that two adjacent checkpoints can have a dramatically large performance gap on STR-NU. Such fluctuation in training is very likely to be one of the reasons for the instability in the final performance and might give rise to untrustworthy conclusions drawn from the final results.

Figure 3: Spearman’s correlations for different datasets showing the low correlation between standard datasets (i.e., MNLI-m, MNLI-mm, and SNLI) and all the other analysis datasets.

Low Correlation between Datasets.

The typical routine for neural network model selection requires practitioners to choose the model or checkpoint hinged on the observation of models’ performance on the validation set. The routine was followed in all previous NLI analysis studies where models were chosen by the performance on standard validation set and tested on analysis sets. An important assumption behind this routine is that the performance on the validation set should be correlated with the models’ general ability. However, as shown in Fig. 1, the striking difference between the wildly fluctuated training curves for analysis sets and the smooth curves for the standard validation set questions the validity of this assumption.

Therefore, to check the effectiveness of model selection under these instabilities, we checked the correlation for the performance on different datasets during training. For dataset , we use to denote the accuracy of the checkpoint at -th time step and trained with the seed , where is the set of all seeds. We calculate the correlation between datasets and by:

where is the number of checkpoints.

The correlations between different NLI datasets are shown in Fig. 3. We can observe high correlation () among standard validation datasets (e.g. MNLI-m, MNLI-mm, SNLI) but low correlations between other dataset pairs, especially when pairing STR-O or STR-NU with MNLI or SNLI. This indicates that: 1) the standard validation set is not representative enough for certain analysis sets; 2) doing model selection solely based on the standard validation set cannot reduce the instability on low-correlated analysis sets.

4 Tracking Instability

Before answering the question how to handle these instabilities, we first seek the source of the instability to get a better understanding of the issue. We start with the intuition that high variance could be the result of high inter-example correlation within the dataset, and then provide hints from experimental observations. Next, we show theoretical evidence to formalize our claim. Finally, we conclude that the major source of variance is the inter-example correlations based on empirical results.

4.1 Inter-Example Correlations

Presumably, the wild fluctuation in the training trajectory on different datasets might come from two potential sources. Firstly, the individual prediction of each example may be highly unstable so that the prediction is constantly changing. Secondly, there might be strong inter-example correlations in the datasets such that a large proportion of predictions are more likely to change simultaneously, thus causing large instability. Here we show that the second reason, i.e., the strong inter-example prediction correlation in the datasets is what contribute most to the overall instability.

Figure 4: The two heatmaps of inter-example correlations matrices for both MNLI and HANS. Each point in the heatmap represents the Spearman’s correlation between the predictions of an example-pair.

We examine the correlation between different example prediction pairs during the training process. In Fig. 4, we calculated the inter-example Spearman’s correlation on MNLI and HANS. Fig. 4 shows a clear difference between the inter-example correlation in stable (MNLI) datasets versus unstable (HANS) datasets. For stable datasets (MNLI), the correlations between the predictions of examples are uniformly low, while for unstable datasets (HANS), there exist clear groups of examples that have very strong inter-correlation between their predictions. This observation suggests that those groups could be a major source of instability if they contain samples with frequently changing predictions.

Statistics Standard Dataset Analysis Dataset
0.24 0.20 0.11 0.38 1.51 0.40 0.34 0.28 0.65 0.90 0.89 3.76
0.18 0.18 0.13 0.12 0.10 0.30 0.17 0.22 0.17 0.19 0.56 0.33
0.16 0.09 0.06 0.36 1.51 0.27 0.28 0.15 0.63 0.88 0.69 3.74
Table 3: The square roots of total variance (Total Var), independent variance (Idp Var), and the absolute covariance () of BERT model on different NLI datasets. Square root is applied to map variances and covariances to a normal range. Analysis datasets have much higher covariance than standard datasets.
Statistics Standard Dataset Analysis Dataset
SQuAD AddSent AddOneSent
0.13 0.57 0.48
0.15 0.33 0.44
0.09 0.43 0.13
Table 4: The square roots of total variance (Total Var), independent variance (Idp Var), and absolute covariance () of BERT model on different RC datasets.

4.2 Variance Decomposition

Next, we provide theoretical support to show how the high inter-example correlation contributes to the large variance in final accuracy. Later, we will also demonstrate that it is the major source of the large variance. Suppose dataset contains examples , where is the number of data points in the dataset, and

are the inputs and labels, respectively. We use a random variable

to denote whether model predicts the -th example correctly: . We ignore the model symbol in our later notations for simplicity. The accuracy of model is another random variable, which equals to the average over , w.r.t. different weights of the model (i.e., caused by different random seeds in our experiments):


We then decompose the variance of the accuracy into the sum of data variances , and the sum of inter-data covariances :


Here, the first term means the instability caused by the randomness in individual example prediction and the second term means the instability caused by the covariance of the prediction between different examples. The latter covariance term is highly related to the inter-example correlation.

Finally, to demonstrate that the inter-example correlation is the major source of high variance, we calculate the total variance, the independent variance (the 1st term in Eq. 4.2), and the covariance (the 2nd term in Eq. 4.2) on every dataset. The results are shown in Table 3. In contrast to similar averages of the independent variance on standard datasets and analysis datasets, we found a large gap between the averages of covariances on different datasets. This different trend of total variance and independent variance proves that the inter-example correlation in the datasets is the major reason for the difference of variance on the analysis datasets.

Premise: Though the author encouraged the lawyer,
the tourist waited.
Hypothesis: The author encouraged the lawyer.
Label: entailment
Premise: The lawyer thought that the senators
supported the manager.
Hypothesis: The senators supported the manager.
Label: non-entailment
Table 5: A highly-correlated example pair in the HANS dataset with the BERT model. This example pair have the largest covariance (0.278) among all the pairs.
Original Context: In February 2010, in response to controversies regarding claims in the Fourth Assessment Report, five climate scientists–all contributing or lead IPCC report authors–wrote in the journal Nature calling for changes to the IPCC. They suggested a range of new organizational options, from tightening the selection of lead authors and contributors to dumping it in favor of a small permanent body or even turning the whole climate science assessment process into a moderated “living” Wikipedia-IPCC. Other recommendations included that the panel employs full-time staff and remove government oversight from its processes to avoid political interference.
Question: How was it suggested that the IPCC avoid political problems?
Answer: remove government oversight from its processes
Distractor Sentence 1: It was suggested that the PANEL avoid nonpolitical problems.
Distractor Sentence 2: It was suggested that the panel could avoid nonpolitical problems by learning.
Table 6: A highly-correlated example pair in the SQuAD-AddSent dataset based with the BERT model. This example pair have the largest covariance (0.278) among all the pairs.
Accuracy Mean
MNLI-m 85.1 95.3 61.6 80.9 81.9 77.3 55.5 59.9 62.9 41.1
Re-Split Dev - 96.2 64.3 81.0 81.7 77.4 56.5 66.0 67.2 48.2
Accuracy Standard Deviation
MNLI-m 0.22 0.37 1.57 0.33 0.36 0.35 0.65 0.88 1.60 3.49
Re-Split Dev - 0.32 1.51 0.52 0.34 0.47 0.83 2.70 1.83 2.64
Table 7: The comparison of means and standard deviations of the accuracies when model selection are conducted based on different development set. ‘MNLI-m’ chooses the best checkpoint based on the MNLI-m validation set. ‘Re-Split Dev’ chooses the best checkpoint based on the corresponding re-splitted analysis-dev set.

4.3 Highly-Correlated Cases

In this section, we take a look at the examples whose predictions have high inter-correlations. As shown in Table 5

, example pairs in NLI datasets with high covariance usually target the same linguistic phenomenon and share similar lexicon usage. These similarities in both syntax and lexicon make the prediction in these two examples highly-correlated. The situation is similar for RC datasets. As adversarial RC datasets such as AddSent are created by appending a distractor sentence at the end of the original passage, different examples can look very similar. In Table 

6, we see two examples are created by appending two similar distractor sentences to the same context, making the predictions of these two examples highly correlated.

In conclusion, since analysis datasets are usually created using pre-specified linguistic patterns/properties and investigation phenomena in mind, the distributions of analysis datasets are less diverse than the distributions of standard datasets. The difficulty of the dataset and the lack of diversity can lead to highly-correlated predictions and high instability in models’ final performances.

5 Implications, Suggestions, and Discussion

So far, we have demonstrated how severe this instability issue is and how the instability can be traced back to the high correlation between predictions of certain example clusters. Now based on all the previous analysis results and conclusions, we discuss some potential ways of how to deal with this instability issue.

We first want to point out that this instability issue is not a simple problem that can be solved by trivial modifications of the dataset, model, or training algorithm. Here, below we first present one initial attempt at illustrating the difficulty of solving this issue via dataset resplitting.

Limitation of Model Selection.

In this experiment, we see if an oracle model selection process can help reduce instability. Unlike the benchmark datasets, such as SNLI, MNLI, and SQuAD, analysis sets are often proposed as a single set without dev/test splits. In Sec. 4, we observe that models’ performances on analysis sets have little correlation with model performance on standard validation sets, making the selection model routine useless for reducing performance instability on analysis sets. Therefore, we do oracle model selection by dividing the original analysis set into an 80% analysis-dev dataset and a 20% analysis-test dataset.

In Table 7, we compare the results of BERT-B on the new analysis-test with model selection based on the results on either MNLI or the corresponding analysis-dev. While model selection on analysis-dev helps increase the mean performance on several datasets555Although the new selection helps increase the performance mean, we suggest not to compute the results on analysis sets as benchmark scores but to only use analysis datasets as toolkits to probe model/architecture changes since analysis datasets are easy to overfit., especially on HANS, STR-O, and STR-NU, indicating the expected high correlation inside the analysis set, however, the variances of final results are not always reduced for different datasets. Hence, besides the performance instability caused by noisy model selection, different random seeds indeed lead to models with different performance on analysis datasets. This observation might indicate that performance instability is relatively independent of the mean performance and hints that current models may have intrinsic randomness brought by different random seeds which is unlikely to be removed through simple dataset/model fixes.

5.1 Implications of Result Instability

If the intrinsic randomness in the model prevents a quick fix, what does this instability issue imply? At first glance, one may view the instability as a problem caused by careless dataset design or deficiency in model architecture/training algorithms. While both parts are indeed imperfect, here we suggest it is more beneficial to view this instability as an inevitable consequence of the current datasets and models. On the data side, as these analysis datasets usually leverage specific rules or linguistic patterns to generate examples targeting specific linguistic phenomena and properties, they contain highly similar examples (examples shown in 4.3). Hence, the model’s predictions of these examples will be inevitably highly-correlated. On the model side, as the current model is not good enough to stably capture these hard linguistic/logical properties through learning, they will exhibit instability over some examples, which is amplified by the high correlation between examples’ predictions. These datasets can still serve as good evaluation tools as long as we are aware of the instability issue and report results with multiple runs. To better handle the instability, we also propose some long and short term solution suggestions below, based on variance reporting and analysis dataset diversification.

5.2 Short/Long Term Suggestions

Better Analysis Reporting (Short Term).

Even if we cannot get a quick fix to remove the instability in the results, it is still important to keep making progress using currently available resources, and more importantly, to accurately evaluate this progress. Therefore, in the short run, we encourage researchers to report the decomposed variance (Idp Var and Cov) for a more accurate understanding of the models and datasets as in Sec 4.2, Table 3 and Table 4. The first number (independent variance, i.e., Idp Var) can be viewed as a metric regarding how stable the model makes one single prediction and this number can be compared across different models. Models with a lower score can be interpreted as being more stable for one single prediction. By comparing models with both total variance and the Idp Var, we can have a better understanding of where the instability of the models comes from. A more stable model should aim to improve the total variance with more focus on Idp Var. If the target is to learn the targeted property of the dataset better, then more focus should be drawn towards the second term when analysing the results.

Model and Dataset Suggestions (Long Term).

In the long run, we should be focusing on improving models (including better inductive biases, large-scale pre-training with tasks concerning structure/compositionality) so that they can get high accuracy stably. Dataset-wise, as different analysis datasets show poor correlation between each other, we suggest building datasets using a diverse set of patterns to create examples, in order to test the systematic capability of certain linguistic properties under different contexts instead of model’s ability to solve one single pattern or property, since more diverse dataset may lead to lower covariance between predictions, which is shown to be the major source of the instability in Section 4.

6 Conclusions

Auxiliary analysis datasets are meant to be important resources for debugging and understanding models. However, large instability of current models on these analysis sets undermine such benefits and bring non-ignorable obstacles for future research. In this paper, we examine the issue of instability in detail, provide theoretical and empirical evidence discovering the high inter-example correlation that causes this issue. Finally, we give suggestions on future research directions and on better analysis variance reporting. We hope this paper will guide researchers on how to handle instability in practice and inspire future work on reducing the instabilities in experiments.


This work was supported by ONR Grant N00014-18-1-2871, DARPA YFA17-D17AP00022, and NSF-CAREER Award 1846185. The views contained in this article are those of the authors and not of the funding agency.


Appendix A Details of NLI Models

For models, we mainly focus on the current state-of-the-art models with a pre-trained transformer structure. In addition, we also selected several traditional models to see how different structures and the use of pre-trained representations influence the result.

a.1 Transformer Models

Bert Devlin et al. (2019).

BERT is a Transformer model pre-trained with masked language supervision on a large unlabeled corpus to obtain deep bi-directional representations Vaswani et al. (2017)

. To conduct the task of NLI, the premise and the hypothesis is concatenated as the input and a simple classifier is added on top of these pre-trained representations to predict the label. The whole model is fine-tuned on NLI datasets before evaluation.

RoBERTa Liu et al. (2019).

RoBERTa uses the same structure as BERT, but carefully tunes the hyper-parameters for pre-training and is trained 10 times more data during pre-training. The fine-tuning architecture and process are the same as BERT.

XLNet Yang et al. (2019).

XLNet also adopts the Transformer structure but the pre-training target is a generalized auto-regressive language modeling. It also can take in infinite-length input by using the Transformer-XL Dai et al. (2019) architecture. The fine-tuning architecture and process are the same as BERT.

a.2 Traditional Models Models

Esim Chen et al. (2017).

ESIM first uses BiLSTM to encode both the premise and the hypothesis sentence and perform cross-attention before making the prediction using a classifier. It is one representative model before the use of pre-trained Transformer structure.

Name Standard/Analysis #Examples #Classes
MNLI-m Standard 9815 3
MNLI-mm Standard 9832 3
SNLI Standard 9842 3
BREAK-NLI Analysis 8193 3
HANS Analysis 30000 2
SNLI-hard Analysis 3261 3
STR-L Analysis 9815 3
STR-S Analysis 8243 3
STR-NE Analysis 9815 3
STR-O Analysis 9815 3
STR-A Analysis 1561 3
STR-NU Analysis 7596 3
Table 8: Dataset statistics and categories for all the NLI datasets.
Model Standard Datasets Analysis Sets
ESIM 77.380.32 77.030.18 88.340.24 78.491.00 49.890.15 75.030.40 74.210.24 69.302.38 51.611.13 57.951.47 53.212.04 21.021.00
ESIM+ELMo 79.830.11 79.850.21 88.810.17 83.241.33 50.070.27 76.300.45 76.290.33 74.030.25 52.800.79 58.421.63 54.411.69 20.951.00
BERT-B 84.72 0.24 84.89 0.20 91.24 0.11 95.530.38 62.311.51 81.300.40 81.790.34 76.910.28 55.370.65 59.570.90 64.960.89 39.023.76
RoBERTa-B 87.640.12 87.660.17 91.940.07 97.040.36 72.451.02 82.440.30 85.130.15 81.970.27 57.390.63 63.380.98 73.841.61 52.803.39
XLNet-B 86.780.28 86.420.14 91.540.11 95.950.63 66.291.08 81.350.37 84.400.17 80.330.28 57.180.56 63.702.04 75.701.48 40.324.31
BERT-L 86.620.17 86.750.19 92.090.09 95.710.53 72.421.78 82.260.40 84.200.22 79.320.48 62.251.55 64.481.71 72.281.01 49.564.20
RoBERTa-L 90.040.17 89.990.15 93.090.12 97.500.19 75.900.99 84.420.30 87.680.19 85.670.22 60.032.04 63.101.71 78.961.91 61.275.25
XLNet-L 89.480.20 89.310.18 92.900.14 97.570.23 75.751.22 83.550.30 87.330.18 84.300.32 60.463.25 67.472.37 84.262.14 62.143.63
Table 9: Means and standard deviations of final performance on NLI datasets for all models.
Model Standard Dataset Analysis Sets
SQuAD AddSent AddOneSent
BERT-B 87.160.13 63.700.57 72.330.48
XLNet-B 89.330.39 69.191.18 77.200.94
Table 10: Means and standard deviations of final F1 on SQuAD dev set for both BERT-B and XLNet-B.

Appendix B Details of NLI Analysis Datasets

We used the following NLI analysis datasets in our experiments: Break NLI Glockner et al. (2018), SNLI-hard Gururangan et al. (2018), NLI Stress Test Naik et al. (2018) and HANS McCoy et al. (2019b).

Break NLI.

The examples in Break NLI resemble the examples in SNLI. The hypothesis is generated by swapping words in the premise so that lexical or world knowledge is required to make the correct prediction.


SNLI hard dataset is a subset of the test set of SNLI. The examples that can be predicted correctly by only looking at the annotation artifacts in the premise sentence are removed.

NLI Stress.

NLI Stress datasets is a collection of datasets modified from MNLI. Each dataset targets one specific linguistic phenomenon, including word overlap, negation, antonyms, numerical reasoning, length mismatch, and spelling errors. Models with certain weaknesses will get low performance on the corresponding dataset.


The examples in HANS are created to reveal three heuristics used by models: the lexical overlap heuristic, the sub-sequence heuristic, and the constituent heuristic. For each heuristic, examples are generated using 5 different templates.

Dataset statistics and categories for all the NLI datasets can be seen in Table 8.

Appendix C Means and Standard Deviations of Final Results on NLI/RC datasets

Here we provide the mean and standard deviation of the final performance over 10 different seeds in Table 9 and Table 10 respectively.


  • M. Bloodgood and J. Grothendieck (2013) Analysis of stopping active learning based on stabilizing predictions. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 10–19. Cited by: §2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: §3.1, §3.2.
  • V. I. S. Carmona, J. Mitchell, and S. Riedel (2018) Behavior analysis of nli models: uncovering the influence of three factors on robustness. NAACL. Cited by: §2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668. Cited by: §A.2, §3.2.
  • C. Clark, M. Yatskar, and L. Zettlemoyer (2019) Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4060–4073. Cited by: §2.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §A.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §A.1, §1, §3.2.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) AllenNLP: a deep semantic natural language processing platform. In

    Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

    pp. 1–6. Cited by: footnote 3.
  • A. Geiger, I. Cases, L. Karttunen, and C. Potts (2019) Posing fair generalization tasks for natural language inference. EMNLP. Cited by: §2.
  • M. Glockner, V. Shwartz, and Y. Goldberg (2018) Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 650–655. Cited by: Appendix B, §2, §3.1.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6904–6913. Cited by: §2.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. Cited by: Appendix B, §1, §2, §3.1.
  • H. He, S. Zha, and H. Wang (2019) Unlearn dataset bias in natural language inference by fitting the residual. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 132–142. Cited by: §2.
  • A. Irpan (2018) Deep reinforcement learning doesn’t work yet. Note: Cited by: §2.
  • A. Jabri, A. Joulin, and L. Van Der Maaten (2016) Revisiting visual question answering baselines. In European conference on computer vision, pp. 727–739. Cited by: §2.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. EMNLP. Cited by: §1, §2, §3.1.
  • M. Lewis and A. Fan (2019) Generative question answering: learning to answer the whole question. In ICLR, Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §A.1, §1, §3.2.
  • M. C. Makel, J. A. Plucker, and B. Hegarty (2012) Replications in psychology research: how often do they really occur?. Perspectives on Psychological Science 7 (6), pp. 537–542. Cited by: §1.
  • R. T. McCoy, J. Min, and T. Linzen (2019a) BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969. Cited by: §1.
  • T. McCoy, E. Pavlick, and T. Linzen (2019b) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3428–3448. Cited by: Appendix B, Figure 1, §1, §2, §3.1.
  • P. Minervini and S. Riedel (2018) Adversarially regularising neural nli models to integrate logical background knowledge. CoNLL. Cited by: §2.
  • N. S. Moosavi, P. A. Utama, A. Rücklé, and I. Gurevych (2019) Improving generalization by incorporating coverage in natural language inference. arXiv preprint arXiv:1909.08940. Cited by: §2.
  • A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig (2018) Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2340–2353. External Links: Link Cited by: Appendix B.
  • A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig (2018) Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2340–2353. Cited by: Figure 1, §3.1.
  • Y. Nie, Y. Wang, and M. Bansal (2019) Analyzing compositionality-sensitivity of nli models. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6867–6874. Cited by: §2.
  • D. Pang, L. H. Lin, and N. A. Smith (2019) Improving natural language inference with a pretrained parser. arXiv preprint arXiv:1909.08217. Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §1, §3.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. ACL. Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. EMNLP. Cited by: §3.1, §3.2.
  • V. Shwartz and I. Dagan (2018) Paraphrase to explicate: revealing implicit noun-compound relations. ACL. Cited by: §2.
  • M. Tsuchiya (2018) Performance impact caused by hidden bias of training data for recognizing textual entailment. LREC. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §A.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of ICLR, Cited by: §1, §2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Cited by: §3.1, §3.2.
  • Y. Yaghoobzadeh, R. Tachet, T. Hazen, and A. Sordoni (2019) Robust natural language inference models with example forgetting. arXiv preprint arXiv:1911.03861. Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. NeurIPS. Cited by: §A.1, §3.2.
  • Y. Yeh and Y. Chen (2019) QAInfomax: learning robust question answering system by mutual information maximization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3361–3366. Cited by: §2.