The MultiBERTs: BERT Reproductions for Robustness Analysis

by   Thibault Sellam, et al.

Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.


CMV-BERT: Contrastive multi-vocab pretraining of BERT

In this work, we represent CMV-BERT, which improves the pretraining of a...

All NLP Tasks Are Generation Tasks: A General Pretraining Framework

There have been various types of pretraining architectures including aut...

HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

BERT-based models are currently used for solving nearly all Natural Lang...

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Recent studies on domain-specific BERT models show that effectiveness on...

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Larger language models have higher accuracy on average, but are they bet...

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but ...

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

There is an increasing focus on model-based dialog evaluation metrics su...

1 Introduction

Contemporary natural language processing (NLP) relies heavily on pretrained language models, which are trained using large-scale unlabeled data.

Bert (Devlin et al., 2019) is a particularly popular choice: it has been widely adopted in academia and industry, and aspects of its performance have been reported in thousands of research papers (see e.g., Rogers et al., 2020, for an overview). Because pretraining large language models is computationally expensive Strubell et al. (2019), the accessibility of this line of research has been greatly facilitated by the release of model checkpoints through libraries such as HuggingFace Transformers (Wolf et al., 2020), which enable researchers to build on large-scale language models without reproducing the work of pretraining. Consequently, most published results are based on a small number of publicly released model checkpoints.

While this reuse of model checkpoints has lowered the cost of research and facilitated head-to-head comparisons, it limits our ability to draw general scientific conclusions about the performance of this class of models Dror et al. (2019); D’Amour et al. (2020); Zhong et al. (2021). The key issue is that reusing model checkpoints makes it hard to generalize observations about the behavior of a single model artifact to statements about the underlying pretraining procedure which generated it. Pretraining such models is an inherently stochastic process which depends on the initialization of the parameters and the ordering of training examples. In fact, D’Amour et al. (2020) report substantial quantitative differences across multiple checkpoints of the same model architecture on several “stress tests” (Naik et al., 2018; McCoy et al., 2019). It is therefore difficult to know how much of the success of a model based on the original Bert checkpoint is due to Bert

’s design, and how much is due to idiosyncracies of a particular run. Understanding this difference is critical if we are to generate reusable insights about deep learning for NLP, and improve the state-of-the-art going forward 

(Zhou et al., 2020; Dodge et al., 2020).

This paper describes MultiBerts, an effort to facilitate more robust research on the Bert model. Our primary contributions are:

  • We release MultiBerts, a set of 25 Bert checkpoints to facilitate studies of robustness to parameter initialization. The release also includes an additional 140 intermediate checkpoints, captured during training for 5 of these runs (28 checkpoints per run), to facilitate studies of learning dynamics. Releasing these models preserves the benefits of a single checkpoint release (low cost of experiments, apples-to-apples comparisons between studies based on these checkpoints), while enabling researchers to draw more general conclusions about the Bert pretraining procedure (§2).

  • We provide recommendations for how to report results with MultiBerts and and present the Multi-Bootstrap, a non-parametric method to quantify the uncertainty of experimental results based on multiple pretraining seeds. To help researchers follow these recommendations, we release a software implementation of the procedure (§4).

  • We document several challenges with reproducing the behavior of the widely-used original Bert release (Devlin et al., 2019). These idiosyncrasies underscore the importance of reproducibility analyses and of distinguishing conclusions about training procedures from conclusions about particular artifacts (§5).

Our checkpoints and statistics libraries are available at:

Figure 1: Distribution of the performance on GLUE dev sets, averaged across finetuning runs for each checkpoint. The dashed line indicates the performance of the original Bert release.
Figure 2: Distribution of the performance on the dev sets of SQuAD v1.1 and v2.0.

2 Release Description


All the checkpoints are trained following the code and procedure of Devlin et al. (2019)

, with minor hyperparameter modifications necessary to obtain comparable results on GLUE 

Wang et al. (2019) (see detailed discussion in §5). We use the Bert

-base architecture, with 12 layers and embedding size 768. The model is trained on the masked language modeling (MLM) and next sentence prediction (NSP) objectives. The MLM objective maximizes the probability of predicting randomly masked tokens in an input passage. The NSP objective maximizes the probability of predicting the next “sentence” (text segment) given the current one. The model is trained using only words in the sentence as features. BERT is trained on a combination of BooksCorpus

(Zhu et al., 2015) and English Wikipedia. Since the exact dataset used to train the original Bert is not available, we used a more recent version that was collected by Turc et al. (2019) with the same methodology.


We release 25 models trained for two million steps each. For each of the five of these models, we release 28 additional checkpoints captured over the course of pretraining (every 20,000 training steps up to 200,000, then every 100,000 steps, where each step involves a batch of 256 sequences). In total, we release 165 checkpoints, which is about 68 GB.

Training Details.

As in the original Bert paper, we used batch size 256 and the Adam optimizer Kingma and Ba (2014) with learning rate 1e-4 and 10,000 warm-up steps. We used the default values for all the other parameters, except the number of steps and sequence length that we set to 512 from the beginning with 80 predictions per sequence.111Specifically, we pretrain for 2M steps and keep the sequence length constant (the paper uses 128 tokens for 90% of the training then 512 for the remaining 10%) to expose the model to more tokens and simplify the implementation. As we were not able to reproduce original Bert exactly using either 1M or 2M steps (see Section 5 for discussion), we release MultiBerts trained with 2M steps under the assumption that higher-performing models are more interesting objects of study. The Bert

code initializes the layers with the truncated Normal distribution, using mean

and standard deviation

. We train using the same configuration as Devlin et al. (2019), with each run taking about 4.5 days on 16 Cloud TPU v2 chips.

Environmental Statement.

We estimate compute costs at around 1728 TPU-hours for each pretraining checkpoint, and around 208 GPU-hours plus 8 TPU-hours for associated fine-tuning experiments (including hyperparameter search and 5x replication). Using the calculations of

Lacoste et al. (2019)222, we estimate this as about 250 kg CO2e for each of our 25 models. Counting the additional experiments of §5, this gives a total of about 6.2 tons CO2e before accounting for offsets or clean energy. Patterson et al. (2021) reports that Google Iowa (us-central1) runs on 78% carbon-free energy, and so we estimate that reproducing these experiments in the public Cloud environment333Experiments were run in a Google data center with similar or lower carbon emissions. would emit closer to 1.4t CO2e, or slightly more than one passenger taking a round-trip flight between San Francisco and New York.

By releasing the trained checkpoints publicly, we aim to enable many research efforts on reproducibility and robustness without requiring this cost to be incurred for every subsequent study.

3 Performance Benchmarks

GLUE Setup.

We report results on the development sets of CoLA 

Warstadt et al. (2018), MNLI (matched) Williams et al. (2018)


Dolan and Brockett (2005)

, QNLI (v2) 

Rajpurkar et al. (2016); Wang et al. (2019), QQP Chen et al. (2018)

, RTE 

Bentivogli et al. (2009), SST-2 Socher et al. (2013), and SST-B Cer et al. (2017), using the same modeling and training approach as Devlin et al. (2019). For each task, we fine-tune Bert

for 3 epochs using a batch size of 32. We run a parameter sweep on learning rates [5e-5, 4e-5, 3e-5, 2e-5] and report the best score. We repeat the procedure five times and average the results.

SQuAD Setup.

We report results on the development sets of SQuAD version 1.1 and 2.0, using a setup similar to (Devlin et al., 2019). For both sets of experiments, we use batch size 48, learning rate 5e-5, and train for 2 epochs.


Figures 1 and 2 show the distribution of the MultiBerts checkpoints’ performance on the development sets of GLUE Wang et al. (2019) and SQuAD Rajpurkar et al. (2016), in comparison to the performance of the original Bert checkpoint.444We used, as linked from On most tasks, original Bert’s performance falls within the same range as MultiBerts (i.e., original Bert is between the minimum and maximum of the MultiBerts’ scores). Original Bert outperforms MultiBerts on QQP, and it underperforms on SQuAD. The discrepancies may be explained by both randomness and differences in training setups, as explored further in Section 5.

Instance-Level Agreement.

Table 1 shows per-example agreement rates on GLUE predictions between pairs of models pretrained with a single seed (“same”) and pairs pretrained with different seeds (“diff”); in all cases, models are fine-tuned with different seeds. With the exception of RTE, we see high agreement (over 90%) on test examples drawn from the same distribution as the training data, and note that agreement is 1-2% lower on average when comparing predictions of models pretrained on different seeds, compared to models pretrained on the same seed. However, this discrepancy becomes significantly more pronounced if we look at out-of-domain “challenge sets” which feature a different data distribution from the training set. That is, evaluating our MNLI models on the anti-sterotypical examples from HANS (McCoy et al., 2019), we see agreement drop from 88% to 82% when comparing across pretraining seeds. Figure 3 shows how this can affect overall accuracy, which can vary over a range of nearly 20% depending on the pretraining seed. Such results underscore the need to evaluate multiple pretraining runs, especially when evaluating a model’s ability to generalize outside of its training distribution.

Same Diff. Same - Diff.
CoLA 91.5% 89.7% 1.7%
MNLI 93.6% 90.1% 3.5%
 HANS (all) 92.2% 88.1% 4.1%
 HANS (neg) 88.3% 81.9% 6.4%
MRPC 91.7% 90.4% 1.3%
QNLI 95.0% 93.2% 1.9%
QQP 95.0% 94.1% 0.9%
RTE 74.3% 73.0% 1.3%
SST-2 97.1% 95.6% 1.4%
STS-B 97.6% 96.2% 1.4%
Table 1: Average per-example agreement between model predictions on each task. This is computed as the average “accuracy” between the predictions of two runs for classification tasks, or Pearson correlation for regression (STS-B). We separate pairs of models that use the same pretraining seed but different finetuning seeds (Same) and pairs that differ both in their pretraining and finetuning seeds (Diff). HANS (neg) refers to only the anti-stereotypical examples (non-entailment), which exhibit significant variability between models (McCoy et al., 2020).
Figure 3: Accuracy of MNLI models on the anti-stereotypical (non-entailment) examples from HANS (McCoy et al., 2020), grouped by pretraining seed. Each column shows the distribution of five fine-tuning runs based on the same initial checkpoint.

4 Hypothesis Testing Using Multiple Checkpoints

The previous section compared MultiBerts with the original Bert, finding some similarities as well as differences in some cases, such as SQuAD. But to what extent can these results be explained by random noise? More generally, how can we quantify the uncertainty of a set of experimental results? A primary goal of MultiBerts is to enable more principled and standardized methods to compare training procedures. To this end, we recommend a non-parametric bootstrapping procedure (which we refer to as the “Multi-Bootstrap”), described below, and implemented as a library function alongside the MultiBerts release. The procedure enables us to make inferences about model performance in the face of multiple sources of randomness, including randomness due to pretraining seed, fine-tuning seed, and finite test data, by using the average behavior over seeds as a means of summarizing expected behavior in an ideal world with infinite samples.

4.1 Interpreting Statistical Results

The advantage of using the Multi-Bootstrap is that it provides an interpretable summary of the amount of remaining uncertainty when summarizing the performance over multiple seeds. The following notation will help us state this precisely. We assume access to model predictions for each instance in the evaluation set. We consider randomness arising from:

  1. The choice of pretraining seed

  2. The choice of finetuning seed

  3. The choice of test sample

The Multi-Bootstrap procedure allows us to account for all of the above. The contribution of MultiBert

s that it enables us to estimate (1), the variance due to pretraining seed, which is not possible given only a single artifact. Note that multiple finetuning runs are not required in order to use Multi-Bootstrap.

For each pretraining seed , let denote the learned model’s prediction on input features and let denote the expected performance metric of on a test distribution over features and labels . For example, the accuracy would be . We can use the test sample to estimate the performance for each of the seeds in MultiBERT, which we denote as .

The performance depends on the seed, but we are interested in summarizing the model over all seeds. A natural summary is the average over seeds, We will denote this by , using the Greek letter to emphasize that it is an unknown quantity that we wish to estimate. Then, we can compute an estimate as

Because is computed under a finite evaluation set and finite number of seeds, it is necessary to quantify the uncertainty of the estimate. The goal of Multi-Bootstrap is to estimate the distribution of the error in this estimate,

. With this, we can compute confidence intervals for

and test hypotheses about , such as whether it is above or another threshold of interest.

Below, we summarize a few common experimental designs that can be studied using this distribution.

Comparison to a Fixed Baseline.

We might seek to summarize the performance or behavior of a proposed model. Examples include:

  • Does Bert encode information about syntax (e.g., as compared to feature-engineered models)? (Tenney et al., 2019; Hewitt and Manning, 2019)

  • Does Bert encode social stereotypes (e.g., as compared to human biases)? (Nadeem et al., 2020)

  • Does Bert encode world knowledge (e.g., as compared to explicit knowledge bases)? (Petroni et al., 2019)

  • Does another model such as RoBERTa (Liu et al., 2019) outperform Bert on tasks like GLUE and SQuAD?

In some of these cases, we might compare against some exogenously-defined baseline of which we only have a single estimate (e.g., random or human performance) or against an existing model that is not derived from the MultiBERT checkpoints. In this case, we treat the baseline as fixed, and jointly bootstrap over seeds and examples in order to estimate variation in the MultiBerts and test data.

Paired samples.

Alternatively, we might seek to assess the effectiveness of a specific intervention on model behavior. In such studies, an intervention is proposed (e.g., representation learning via a specific intermediate task, or a specific architecture change) which can be applied to any pretrained Bert checkpoint. The question is whether such an intervention results in an improvement over the original Bert pretraining procedure. That is, does the intervention reliably produce the desired effect, or is the observed effect due to the idiosyncracies of a particular model artifact? Examples of such studies include:

  • Does intermediate tuning on NLI after pretraining make models more robust across language understanding tasks (Phang et al., 2018)?

  • Does pruning attention heads degrade model performance on downstream tasks (Voita et al., 2019)?

  • Does augmenting Bert with information about semantic roles improve performance on benchmark tasks (Zhang et al., 2020)?

We will refer to studies like the above as paired since each instance of the baseline model (which does not receive the intervention) can be paired with an instance of the proposed model (which receives the stated intervention) such that and are based on the same pretrained checkpoint produced using the same seed. Denoting and as the expected performance defined above for the baseline and intervention model respectively, our goal is to understand the difference between the estimates and .

In a paired study, Multi-Bootstrap allows us to estimate both of the errors and , as well as the correlation between the two. Together, these allow us to estimate the overall error

Unpaired samples.

Finally, we might seek to compare a number of seeds in both the intervention and baseline models, but may not expect them to be aligned in their dependence on the seed. For example, the second model may be a different architecture so that they do not share checkpoints, or they may be generated from an entirely separate initialization scheme. We refer to such studies as unpaired. Like in a paired study, the Multi-Bootstrap allows us to estimate the errors and ; however, in an unpaired study, we cannot estimate the correlation between the errors. Thus, we assume that the correlation is zero. This will give a conservative estimate of the error as long as and are not negatively correlated. There is little reason to believe that the random seeds used for two different models would induce a negative correlation between the models’ performance, so this assumption is relatively safe.

Hypothesis Testing.

With the measured uncertainty, we recommend testing whether or not the difference is meaningfully different from some arbitrary predefined threshold (i.e.,

in the typical case). Specifically, we are often interested in rejecting the null hypothesis that the intervention does not improve over the baseline model, i.e.,


in a statistically rigorous way. This can be done using the Multi-Bootstrap procedure described below.

4.2 Multi-Bootstrap Procedure

The Multi-Bootstrap procedure is a non-parametric bootstrapping procedure that allows us to estimate the distribution of the error over the seeds and test instances. Our Multi-Bootstrap procedure supports both paired and unpaired study designs, differentiating the two settings only in the way the sampling is performed.

To keep the presentation simple, we will assume that the performance is an average over a per-example metric , and is similarly an empirical average with the observed test examples,

Our discussion here generalizes to any performance metric which behaves asymptotically like an average, including the accuracy, AUC, BLEU score, and expected calibration error.

While there is a rich literature on bootstrap methods (e.g., Efron and Tibshirani, 1994), the Multi-Bootstrap is a new bootstrap method for handling the structure of the way that randomness from the seeds and test set creates error in the estimate . The statistical underpinnings of this approach share theoretical and methodological connections to inference procedures for two-sample tests Van der Vaart (2000)

, where the samples from each population are independent. However, in those settings, the test statistics naturally differ as a result of the scientific question at hand.

In this procedure, we generate a bootstrap sample from the full sample with replacement separately over both the randomness from the pretraining seed and from the test set . That is, we generate a sample of pretraining seeds with each drawn randomly with replacement from the pretraining seeds, and we generate a test set sample with each pair drawn randomly with replacement from the full test set. Then, we compute the bootstrap estimate as

It turns out that when and are large enough, the distribution of the estimation error is approximated well by the distribution of over re-draws of the bootstrap samples.

For nested sources of randomness (i.e., if for each pretraining seed , we have estimates from multiple finetuning seeds), we average over all of the inner samples (finetuning seeds) in every bootstrap sample, motivated by the recommendations for bootstrapping clustered data recommended by Field and Welsh (2007).

Paired Design.

In a paired design, the Multi-Bootstrap procedure can additionally tell us the joint distribution between

and . To do so, one must use the same bootstrap samples of the seeds and test examples for both models. Then, the correlation between the errors and is well approximated by the correlation between the bootstrap errors and .

In particular, recall that we defined the difference in performance between the intervention and the baseline to be , and defined its estimator to be . With the Multi-Bootstrap, we can estimate the bootstrapped difference

With this, the distribution of the estimation error , is well approximated by the distribution of over bootstrap samples.

Unpaired Design.

For studies that do not match the paired format, we adapt the Multi-Bootstrapping procedure so that, rather than sampling a single pretraining seed that is shared between and , we sample pretraining seeds for each independently. The remainder of the algorithm proceeds as in the paired case. Relative to the paired design discussed above, this additionally assumes that the errors due to differences in pretraining seed between and are independent.


A valid -value for the hypothesis test described in Equation 1 is the fraction of bootstrap samples from the above procedure for which the estimate is negative.

Handling single seeds for the baseline.

Often, we do not have access to multiple estimates of , for example, when the baseline against which we are comparing is an estimate of human performance for which only one experiment was run, or when is the performance of a previously-published model for which there only exists a single artifact or for which we do not have direct access to model predictions. When we have only a point estimate of for the baseline , we recommend still using Multi-Bootstrap in order to compute a confidence interval around for , and simply reporting where the given estimate of baseline performance falls within that distribution. An example of this is given in Figure 1, in which the distribution of MultiBerts performance is compared to that from the single checkpoint of the original Bert release. In general such results should be interpreted conservatively, as we cannot make any claims about the variance of the baseline model.

5 Application of Multi-Bootstrap: Reproducing Original Bert

We now discusses challenges in reproducing the performance of the original Bert checkpoint, using the Multi-Bootstrap procedure presented above.

The performance of the original bert-base-uncased

checkpoint appears to be an outlier when viewed against the distribution of scores obtained using the Multi

Berts reproductions. Specifically, in reproducing the training recipe of Devlin et al. (2019), we found it difficult to simultaneously match performance on all tasks using a single set of hyperparameters. Devlin et al. (2019) reports training for 1M steps. However, as shown in Figure 1 and 2, models pretrained for 1M steps matched the original checkpoint on SQuAD but lagged behind on GLUE tasks; if pretraining continues to 2M steps, GLUE performance matches the original checkpoint but SQuAD performance is significantly higher.

The above observations suggest two separate but related hypotheses (below) about the Bert pretraining procedure. In this section, we use the proposed Multi-Bootstrap procedure to test these hypotheses.

  1. On most tasks, running Bert pretraining for 2M steps produces better models than 1M steps. We test this using paired Multi-Bootstrap (§5.1).

  2. The MultiBerts training procedure outperforms the original Bert procedure on SQuAD. We test this using unpaired Multi-Bootstrap (§5.2).

5.1 How many steps to pretrain?

To test our first hypothesis, we use the proposed Multi-Bootstrap, letting be the predictor induced by the Bert pretraining procedure using the default 1M steps, and letting be the predictor resulting from the proposed intervention of training to 2M steps. From a glance at the histograms in Figure 5, we can see that MNLI appears to be a case where 2M is generally better, while MRPC and RTE appear less conclusive. Multi-Bootstrap allows us to test this quantitatively using samples over both the seeds and the test examples. Results are shown in Table 2. We find that MNLI conclusively performs better ( with ) with 2M steps; for RTE and MRPC we cannot reject the null hypothesis of no difference ( and respectively).

(1M steps) 0.837 0.644 0.861
(2M steps) 0.844 0.655 0.860
0.007 0.011 -0.001
-value ( that ) 0.001 0.141 0.564
Table 2: Expected scores (accuracy), effect sizes, and -values from Multi-Bootstrap on selected GLUE tasks. We pre-select the best fine-tuning learning rate by averaging over runs; this is 3e-5 for checkpoints at 1M steps, and 2e-5 for checkpoints at 2M pretraining steps. All tests use 1000 bootstrap samples, in paired mode on the five seeds for which both 1M and 2M steps are available.

As an example of the utility of this procedure, Figure 4 shows the distribution of individual samples of for the intervention and baseline from this bootstrap procedure (which we denote as and , respectively). The distributions overlap significantly, but the samples are highly correlated due to the paired sampling, and we find that individual samples of the difference are nearly always positive.

Figure 4: Distribution of estimated performance on MNLI across boostrap samples, for runs with 1M or 2M steps. Individual samples of and on the left, deltas shown on the right. Bootstrap experiment is run as in Table 2, which gives with .
Figure 5: Distribution of the performance on GLUE dev sets, showing only runs with the best selected learning rate for each task. Each plot shows 25 points (5 finetuning x 5 pretraining) for each of the 1M and 2M-step versions of each of the pretraining runs for which we release intermediate checkpoints (§2).

5.2 Does the MultiBerts procedure outperform original Bert on SQuAD?

To test our second hypothesis, i.e., that the MultiBerts procedure outperforms original Bert on SQuAD, we must use the unpaired Multi-Bootstrap procedure. In particular, we are limited to the case in which we only have a point estimate of , because we only have a single estimate of the performance of our baseline model (the original Bert checkpoint). However, the Multi-Bootstrap procedure still allows us to estimate variance across our MultiBerts seeds and across the examples in the evaluation set. On SQuAD 2.0, we find that the MultiBerts models trained for 2M steps outperform original Bert with a 95% confidence range of 1.9% to 2.9% and for the null hypothesis, corroborating our intuition from Figure 2.

We include notebooks for the above analyses in our code release.

6 Conclusion

To make progress on language model pretraining, it is essential to distinguish between the performance of specific model artifacts and the impact of the training procedures that generate those artifacts. To this end, we have presented two resources: MultiBerts, a set of 25 model checkpoints to support robust research on Bert, and the Multi-Bootstrap, a non-parametric statistical method to estimate the uncertainty of model comparisons across multiple training seeds. We demonstrated the utility of these resources by showing that pretraining for a longer number of steps leads to a significant improvement when fine-tuning on MNLI, but not on two smaller datasets. We hope that the release of multiple checkpoints and the use of principled hypothesis testing will become standard practices in research on pretrained language models.


The authors wish to thank Kellie Webster and Ming-Wei Chang for their feedback and suggestions.


  • L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, and B. Magnini (2009) The fifth PASCAL recognizing textual entailment challenge. Cited by: §3.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Link, Document Cited by: §3.
  • Z. Chen, H. Zhang, X. Zhang, and L. Zhao (2018) Quora question pairs. University of Waterloo. Cited by: §3.
  • A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, et al. (2020)

    Underspecification presents challenges for credibility in modern machine learning

    arXiv preprint arXiv:2011.03395. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: 3rd item, §1, §2, §2, §3, §3, §5.
  • J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith (2020) Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305. Cited by: §1.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: §3.
  • R. Dror, S. Shlomov, and R. Reichart (2019) Deep dominance - how to properly compare deep neural models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2773–2785. External Links: Link, Document Cited by: §1.
  • B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC Press. Cited by: §4.2.
  • C. A. Field and A. H. Welsh (2007) Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (3), pp. 369–390. Cited by: §4.2.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Link, Document Cited by: 1st item.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.
  • A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres (2019) Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: 4th item.
  • R. T. McCoy, J. Min, and T. Linzen (2020) BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In

    Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

    Online, pp. 217–227. External Links: Link, Document Cited by: Figure 3, Table 1.
  • T. McCoy, E. Pavlick, and T. Linzen (2019)

    Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. External Links: Link, Document Cited by: §1, §3.
  • M. Nadeem, A. Bethke, and S. Reddy (2020) Stereoset: measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456. Cited by: 2nd item.
  • A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig (2018) Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2340–2353. External Links: Link Cited by: §1.
  • D. Patterson, J. Gonzalez, Q. Le, C. Liang, L. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean (2021) Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. Cited by: §2.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2463–2473. External Links: Link, Document Cited by: 3rd item.
  • J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088. Cited by: 1st item.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §3, §3.
  • A. Rogers, O. Kovaleva, and A. Rumshisky (2020) A primer in BERTology: what we know about how BERT works. Transactions of the Association for Computational Linguistics 8, pp. 842–866. External Links: Link, Document Cited by: §1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §3.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3645–3650. External Links: Link, Document Cited by: §1.
  • I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4593–4601. External Links: Link, Document Cited by: 1st item.
  • I. Turc, M. Chang, K. Lee, and K. Toutanova (2019) Well-read students learn better: on the importance of pre-training compact models. arXiv preprint arXiv:1908.08962. Cited by: §2.
  • A. W. Van der Vaart (2000) Asymptotic statistics. Vol. 3, Cambridge University Press. Cited by: §4.2.
  • E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5797–5808. External Links: Link, Document Cited by: 2nd item.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. Note: In the Proceedings of ICLR. Cited by: §2, §3, §3.
  • A. Warstadt, A. Singh, and S. R. Bowman (2018) Neural network acceptability judgments. arXiv preprint 1805.12471. Cited by: §3.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: §3.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §1.
  • Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou (2020) Semantics-aware bert for language understanding. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 9628–9635. Cited by: 3rd item.
  • R. Zhong, D. Ghosh, D. Klein, and J. Steinhardt (2021) Are larger pretrained language models uniformly better? comparing performance at the instance level. arXiv preprint arXiv:2105.06020. Cited by: §1.
  • X. Zhou, Y. Nie, H. Tan, and M. Bansal (2020) The curse of performance instability in analysis datasets: consequences, source, and suggestions. arXiv preprint arXiv:2004.13606. Cited by: §1.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    2015 IEEE International Conference on Computer Vision (ICCV)

    Vol. , pp. 19–27. External Links: Document, ISSN Cited by: §2.