1 Introduction
Contemporary natural language processing (NLP) relies heavily on pretrained language models, which are trained using largescale unlabeled data.
Bert (Devlin et al., 2019) is a particularly popular choice: it has been widely adopted in academia and industry, and aspects of its performance have been reported in thousands of research papers (see e.g., Rogers et al., 2020, for an overview). Because pretraining large language models is computationally expensive Strubell et al. (2019), the accessibility of this line of research has been greatly facilitated by the release of model checkpoints through libraries such as HuggingFace Transformers (Wolf et al., 2020), which enable researchers to build on largescale language models without reproducing the work of pretraining. Consequently, most published results are based on a small number of publicly released model checkpoints.While this reuse of model checkpoints has lowered the cost of research and facilitated headtohead comparisons, it limits our ability to draw general scientific conclusions about the performance of this class of models Dror et al. (2019); D’Amour et al. (2020); Zhong et al. (2021). The key issue is that reusing model checkpoints makes it hard to generalize observations about the behavior of a single model artifact to statements about the underlying pretraining procedure which generated it. Pretraining such models is an inherently stochastic process which depends on the initialization of the parameters and the ordering of training examples. In fact, D’Amour et al. (2020) report substantial quantitative differences across multiple checkpoints of the same model architecture on several “stress tests” (Naik et al., 2018; McCoy et al., 2019). It is therefore difficult to know how much of the success of a model based on the original Bert checkpoint is due to Bert
’s design, and how much is due to idiosyncracies of a particular run. Understanding this difference is critical if we are to generate reusable insights about deep learning for NLP, and improve the stateoftheart going forward
(Zhou et al., 2020; Dodge et al., 2020).This paper describes MultiBerts, an effort to facilitate more robust research on the Bert model. Our primary contributions are:

We release MultiBerts, a set of 25 Bert checkpoints to facilitate studies of robustness to parameter initialization. The release also includes an additional 140 intermediate checkpoints, captured during training for 5 of these runs (28 checkpoints per run), to facilitate studies of learning dynamics. Releasing these models preserves the benefits of a single checkpoint release (low cost of experiments, applestoapples comparisons between studies based on these checkpoints), while enabling researchers to draw more general conclusions about the Bert pretraining procedure (§2).

We provide recommendations for how to report results with MultiBerts and and present the MultiBootstrap, a nonparametric method to quantify the uncertainty of experimental results based on multiple pretraining seeds. To help researchers follow these recommendations, we release a software implementation of the procedure (§4).

We document several challenges with reproducing the behavior of the widelyused original Bert release (Devlin et al., 2019). These idiosyncrasies underscore the importance of reproducibility analyses and of distinguishing conclusions about training procedures from conclusions about particular artifacts (§5).
Our checkpoints and statistics libraries are available at: http://goo.gle/multiberts.
2 Release Description
Overview.
All the checkpoints are trained following the code and procedure of Devlin et al. (2019)
, with minor hyperparameter modifications necessary to obtain comparable results on GLUE
Wang et al. (2019) (see detailed discussion in §5). We use the Bertbase architecture, with 12 layers and embedding size 768. The model is trained on the masked language modeling (MLM) and next sentence prediction (NSP) objectives. The MLM objective maximizes the probability of predicting randomly masked tokens in an input passage. The NSP objective maximizes the probability of predicting the next “sentence” (text segment) given the current one. The model is trained using only words in the sentence as features. BERT is trained on a combination of BooksCorpus
(Zhu et al., 2015) and English Wikipedia. Since the exact dataset used to train the original Bert is not available, we used a more recent version that was collected by Turc et al. (2019) with the same methodology.Checkpoints.
We release 25 models trained for two million steps each. For each of the five of these models, we release 28 additional checkpoints captured over the course of pretraining (every 20,000 training steps up to 200,000, then every 100,000 steps, where each step involves a batch of 256 sequences). In total, we release 165 checkpoints, which is about 68 GB.
Training Details.
As in the original Bert paper, we used batch size 256 and the Adam optimizer Kingma and Ba (2014) with learning rate 1e4 and 10,000 warmup steps. We used the default values for all the other parameters, except the number of steps and sequence length that we set to 512 from the beginning with 80 predictions per sequence.^{1}^{1}1Specifically, we pretrain for 2M steps and keep the sequence length constant (the paper uses 128 tokens for 90% of the training then 512 for the remaining 10%) to expose the model to more tokens and simplify the implementation. As we were not able to reproduce original Bert exactly using either 1M or 2M steps (see Section 5 for discussion), we release MultiBerts trained with 2M steps under the assumption that higherperforming models are more interesting objects of study. The Bert
code initializes the layers with the truncated Normal distribution, using mean
. We train using the same configuration as Devlin et al. (2019), with each run taking about 4.5 days on 16 Cloud TPU v2 chips.Environmental Statement.
We estimate compute costs at around 1728 TPUhours for each pretraining checkpoint, and around 208 GPUhours plus 8 TPUhours for associated finetuning experiments (including hyperparameter search and 5x replication). Using the calculations of
Lacoste et al. (2019)^{2}^{2}2https://mlco2.github.io/impact/, we estimate this as about 250 kg CO2e for each of our 25 models. Counting the additional experiments of §5, this gives a total of about 6.2 tons CO2e before accounting for offsets or clean energy. Patterson et al. (2021) reports that Google Iowa (uscentral1) runs on 78% carbonfree energy, and so we estimate that reproducing these experiments in the public Cloud environment^{3}^{3}3Experiments were run in a Google data center with similar or lower carbon emissions. would emit closer to 1.4t CO2e, or slightly more than one passenger taking a roundtrip flight between San Francisco and New York.By releasing the trained checkpoints publicly, we aim to enable many research efforts on reproducibility and robustness without requiring this cost to be incurred for every subsequent study.
3 Performance Benchmarks
GLUE Setup.
We report results on the development sets of CoLA
Warstadt et al. (2018), MNLI (matched) Williams et al. (2018), MRPC
Dolan and Brockett (2005), QNLI (v2)
Rajpurkar et al. (2016); Wang et al. (2019), QQP Chen et al. (2018), RTE
Bentivogli et al. (2009), SST2 Socher et al. (2013), and SSTB Cer et al. (2017), using the same modeling and training approach as Devlin et al. (2019). For each task, we finetune Bertfor 3 epochs using a batch size of 32. We run a parameter sweep on learning rates [5e5, 4e5, 3e5, 2e5] and report the best score. We repeat the procedure five times and average the results.
SQuAD Setup.
We report results on the development sets of SQuAD version 1.1 and 2.0, using a setup similar to (Devlin et al., 2019). For both sets of experiments, we use batch size 48, learning rate 5e5, and train for 2 epochs.
Results.
Figures 1 and 2 show the distribution of the MultiBerts checkpoints’ performance on the development sets of GLUE Wang et al. (2019) and SQuAD Rajpurkar et al. (2016), in comparison to the performance of the original Bert checkpoint.^{4}^{4}4We used https://storage.googleapis.com/bert_models/2020_02_20/uncased_L12_H768_A12.zip, as linked from https://github.com/googleresearch/bert On most tasks, original Bert’s performance falls within the same range as MultiBerts (i.e., original Bert is between the minimum and maximum of the MultiBerts’ scores). Original Bert outperforms MultiBerts on QQP, and it underperforms on SQuAD. The discrepancies may be explained by both randomness and differences in training setups, as explored further in Section 5.
InstanceLevel Agreement.
Table 1 shows perexample agreement rates on GLUE predictions between pairs of models pretrained with a single seed (“same”) and pairs pretrained with different seeds (“diff”); in all cases, models are finetuned with different seeds. With the exception of RTE, we see high agreement (over 90%) on test examples drawn from the same distribution as the training data, and note that agreement is 12% lower on average when comparing predictions of models pretrained on different seeds, compared to models pretrained on the same seed. However, this discrepancy becomes significantly more pronounced if we look at outofdomain “challenge sets” which feature a different data distribution from the training set. That is, evaluating our MNLI models on the antisterotypical examples from HANS (McCoy et al., 2019), we see agreement drop from 88% to 82% when comparing across pretraining seeds. Figure 3 shows how this can affect overall accuracy, which can vary over a range of nearly 20% depending on the pretraining seed. Such results underscore the need to evaluate multiple pretraining runs, especially when evaluating a model’s ability to generalize outside of its training distribution.
Same  Diff.  Same  Diff.  

CoLA  91.5%  89.7%  1.7% 
MNLI  93.6%  90.1%  3.5% 
HANS (all)  92.2%  88.1%  4.1% 
HANS (neg)  88.3%  81.9%  6.4% 
MRPC  91.7%  90.4%  1.3% 
QNLI  95.0%  93.2%  1.9% 
QQP  95.0%  94.1%  0.9% 
RTE  74.3%  73.0%  1.3% 
SST2  97.1%  95.6%  1.4% 
STSB  97.6%  96.2%  1.4% 
4 Hypothesis Testing Using Multiple Checkpoints
The previous section compared MultiBerts with the original Bert, finding some similarities as well as differences in some cases, such as SQuAD. But to what extent can these results be explained by random noise? More generally, how can we quantify the uncertainty of a set of experimental results? A primary goal of MultiBerts is to enable more principled and standardized methods to compare training procedures. To this end, we recommend a nonparametric bootstrapping procedure (which we refer to as the “MultiBootstrap”), described below, and implemented as a library function alongside the MultiBerts release. The procedure enables us to make inferences about model performance in the face of multiple sources of randomness, including randomness due to pretraining seed, finetuning seed, and finite test data, by using the average behavior over seeds as a means of summarizing expected behavior in an ideal world with infinite samples.
4.1 Interpreting Statistical Results
The advantage of using the MultiBootstrap is that it provides an interpretable summary of the amount of remaining uncertainty when summarizing the performance over multiple seeds. The following notation will help us state this precisely. We assume access to model predictions for each instance in the evaluation set. We consider randomness arising from:

The choice of pretraining seed

The choice of finetuning seed

The choice of test sample
The MultiBootstrap procedure allows us to account for all of the above. The contribution of MultiBert
s that it enables us to estimate (1), the variance due to pretraining seed, which is not possible given only a single artifact. Note that multiple finetuning runs are not required in order to use MultiBootstrap.
For each pretraining seed , let denote the learned model’s prediction on input features and let denote the expected performance metric of on a test distribution over features and labels . For example, the accuracy would be . We can use the test sample to estimate the performance for each of the seeds in MultiBERT, which we denote as .
The performance depends on the seed, but we are interested in summarizing the model over all seeds. A natural summary is the average over seeds, We will denote this by , using the Greek letter to emphasize that it is an unknown quantity that we wish to estimate. Then, we can compute an estimate as
Because is computed under a finite evaluation set and finite number of seeds, it is necessary to quantify the uncertainty of the estimate. The goal of MultiBootstrap is to estimate the distribution of the error in this estimate,
. With this, we can compute confidence intervals for
and test hypotheses about , such as whether it is above or another threshold of interest.Below, we summarize a few common experimental designs that can be studied using this distribution.
Comparison to a Fixed Baseline.
We might seek to summarize the performance or behavior of a proposed model. Examples include:

Does Bert encode social stereotypes (e.g., as compared to human biases)? (Nadeem et al., 2020)

Does Bert encode world knowledge (e.g., as compared to explicit knowledge bases)? (Petroni et al., 2019)

Does another model such as RoBERTa (Liu et al., 2019) outperform Bert on tasks like GLUE and SQuAD?
In some of these cases, we might compare against some exogenouslydefined baseline of which we only have a single estimate (e.g., random or human performance) or against an existing model that is not derived from the MultiBERT checkpoints. In this case, we treat the baseline as fixed, and jointly bootstrap over seeds and examples in order to estimate variation in the MultiBerts and test data.
Paired samples.
Alternatively, we might seek to assess the effectiveness of a specific intervention on model behavior. In such studies, an intervention is proposed (e.g., representation learning via a specific intermediate task, or a specific architecture change) which can be applied to any pretrained Bert checkpoint. The question is whether such an intervention results in an improvement over the original Bert pretraining procedure. That is, does the intervention reliably produce the desired effect, or is the observed effect due to the idiosyncracies of a particular model artifact? Examples of such studies include:

Does intermediate tuning on NLI after pretraining make models more robust across language understanding tasks (Phang et al., 2018)?

Does pruning attention heads degrade model performance on downstream tasks (Voita et al., 2019)?

Does augmenting Bert with information about semantic roles improve performance on benchmark tasks (Zhang et al., 2020)?
We will refer to studies like the above as paired since each instance of the baseline model (which does not receive the intervention) can be paired with an instance of the proposed model (which receives the stated intervention) such that and are based on the same pretrained checkpoint produced using the same seed. Denoting and as the expected performance defined above for the baseline and intervention model respectively, our goal is to understand the difference between the estimates and .
In a paired study, MultiBootstrap allows us to estimate both of the errors and , as well as the correlation between the two. Together, these allow us to estimate the overall error
Unpaired samples.
Finally, we might seek to compare a number of seeds in both the intervention and baseline models, but may not expect them to be aligned in their dependence on the seed. For example, the second model may be a different architecture so that they do not share checkpoints, or they may be generated from an entirely separate initialization scheme. We refer to such studies as unpaired. Like in a paired study, the MultiBootstrap allows us to estimate the errors and ; however, in an unpaired study, we cannot estimate the correlation between the errors. Thus, we assume that the correlation is zero. This will give a conservative estimate of the error as long as and are not negatively correlated. There is little reason to believe that the random seeds used for two different models would induce a negative correlation between the models’ performance, so this assumption is relatively safe.
Hypothesis Testing.
With the measured uncertainty, we recommend testing whether or not the difference is meaningfully different from some arbitrary predefined threshold (i.e.,
in the typical case). Specifically, we are often interested in rejecting the null hypothesis that the intervention does not improve over the baseline model, i.e.,
(1) 
in a statistically rigorous way. This can be done using the MultiBootstrap procedure described below.
4.2 MultiBootstrap Procedure
The MultiBootstrap procedure is a nonparametric bootstrapping procedure that allows us to estimate the distribution of the error over the seeds and test instances. Our MultiBootstrap procedure supports both paired and unpaired study designs, differentiating the two settings only in the way the sampling is performed.
To keep the presentation simple, we will assume that the performance is an average over a perexample metric , and is similarly an empirical average with the observed test examples,
Our discussion here generalizes to any performance metric which behaves asymptotically like an average, including the accuracy, AUC, BLEU score, and expected calibration error.
While there is a rich literature on bootstrap methods (e.g., Efron and Tibshirani, 1994), the MultiBootstrap is a new bootstrap method for handling the structure of the way that randomness from the seeds and test set creates error in the estimate . The statistical underpinnings of this approach share theoretical and methodological connections to inference procedures for twosample tests Van der Vaart (2000)
, where the samples from each population are independent. However, in those settings, the test statistics naturally differ as a result of the scientific question at hand.
In this procedure, we generate a bootstrap sample from the full sample with replacement separately over both the randomness from the pretraining seed and from the test set . That is, we generate a sample of pretraining seeds with each drawn randomly with replacement from the pretraining seeds, and we generate a test set sample with each pair drawn randomly with replacement from the full test set. Then, we compute the bootstrap estimate as
It turns out that when and are large enough, the distribution of the estimation error is approximated well by the distribution of over redraws of the bootstrap samples.
For nested sources of randomness (i.e., if for each pretraining seed , we have estimates from multiple finetuning seeds), we average over all of the inner samples (finetuning seeds) in every bootstrap sample, motivated by the recommendations for bootstrapping clustered data recommended by Field and Welsh (2007).
Paired Design.
In a paired design, the MultiBootstrap procedure can additionally tell us the joint distribution between
and . To do so, one must use the same bootstrap samples of the seeds and test examples for both models. Then, the correlation between the errors and is well approximated by the correlation between the bootstrap errors and .In particular, recall that we defined the difference in performance between the intervention and the baseline to be , and defined its estimator to be . With the MultiBootstrap, we can estimate the bootstrapped difference
With this, the distribution of the estimation error , is well approximated by the distribution of over bootstrap samples.
Unpaired Design.
For studies that do not match the paired format, we adapt the MultiBootstrapping procedure so that, rather than sampling a single pretraining seed that is shared between and , we sample pretraining seeds for each independently. The remainder of the algorithm proceeds as in the paired case. Relative to the paired design discussed above, this additionally assumes that the errors due to differences in pretraining seed between and are independent.
Values.
A valid value for the hypothesis test described in Equation 1 is the fraction of bootstrap samples from the above procedure for which the estimate is negative.
Handling single seeds for the baseline.
Often, we do not have access to multiple estimates of , for example, when the baseline against which we are comparing is an estimate of human performance for which only one experiment was run, or when is the performance of a previouslypublished model for which there only exists a single artifact or for which we do not have direct access to model predictions. When we have only a point estimate of for the baseline , we recommend still using MultiBootstrap in order to compute a confidence interval around for , and simply reporting where the given estimate of baseline performance falls within that distribution. An example of this is given in Figure 1, in which the distribution of MultiBerts performance is compared to that from the single checkpoint of the original Bert release. In general such results should be interpreted conservatively, as we cannot make any claims about the variance of the baseline model.
5 Application of MultiBootstrap: Reproducing Original Bert
We now discusses challenges in reproducing the performance of the original Bert checkpoint, using the MultiBootstrap procedure presented above.
The performance of the original bertbaseuncased
checkpoint appears to be an outlier when viewed against the distribution of scores obtained using the Multi
Berts reproductions. Specifically, in reproducing the training recipe of Devlin et al. (2019), we found it difficult to simultaneously match performance on all tasks using a single set of hyperparameters. Devlin et al. (2019) reports training for 1M steps. However, as shown in Figure 1 and 2, models pretrained for 1M steps matched the original checkpoint on SQuAD but lagged behind on GLUE tasks; if pretraining continues to 2M steps, GLUE performance matches the original checkpoint but SQuAD performance is significantly higher.The above observations suggest two separate but related hypotheses (below) about the Bert pretraining procedure. In this section, we use the proposed MultiBootstrap procedure to test these hypotheses.
5.1 How many steps to pretrain?
To test our first hypothesis, we use the proposed MultiBootstrap, letting be the predictor induced by the Bert pretraining procedure using the default 1M steps, and letting be the predictor resulting from the proposed intervention of training to 2M steps. From a glance at the histograms in Figure 5, we can see that MNLI appears to be a case where 2M is generally better, while MRPC and RTE appear less conclusive. MultiBootstrap allows us to test this quantitatively using samples over both the seeds and the test examples. Results are shown in Table 2. We find that MNLI conclusively performs better ( with ) with 2M steps; for RTE and MRPC we cannot reject the null hypothesis of no difference ( and respectively).
MNLI  RTE  MRPC  

(1M steps)  0.837  0.644  0.861 
(2M steps)  0.844  0.655  0.860 
0.007  0.011  0.001  
value ( that )  0.001  0.141  0.564 
As an example of the utility of this procedure, Figure 4 shows the distribution of individual samples of for the intervention and baseline from this bootstrap procedure (which we denote as and , respectively). The distributions overlap significantly, but the samples are highly correlated due to the paired sampling, and we find that individual samples of the difference are nearly always positive.
5.2 Does the MultiBerts procedure outperform original Bert on SQuAD?
To test our second hypothesis, i.e., that the MultiBerts procedure outperforms original Bert on SQuAD, we must use the unpaired MultiBootstrap procedure. In particular, we are limited to the case in which we only have a point estimate of , because we only have a single estimate of the performance of our baseline model (the original Bert checkpoint). However, the MultiBootstrap procedure still allows us to estimate variance across our MultiBerts seeds and across the examples in the evaluation set. On SQuAD 2.0, we find that the MultiBerts models trained for 2M steps outperform original Bert with a 95% confidence range of 1.9% to 2.9% and for the null hypothesis, corroborating our intuition from Figure 2.
We include notebooks for the above analyses in our code release.
6 Conclusion
To make progress on language model pretraining, it is essential to distinguish between the performance of specific model artifacts and the impact of the training procedures that generate those artifacts. To this end, we have presented two resources: MultiBerts, a set of 25 model checkpoints to support robust research on Bert, and the MultiBootstrap, a nonparametric statistical method to estimate the uncertainty of model comparisons across multiple training seeds. We demonstrated the utility of these resources by showing that pretraining for a longer number of steps leads to a significant improvement when finetuning on MNLI, but not on two smaller datasets. We hope that the release of multiple checkpoints and the use of principled hypothesis testing will become standard practices in research on pretrained language models.
Acknowledgments
The authors wish to thank Kellie Webster and MingWei Chang for their feedback and suggestions.
References
 The fifth PASCAL recognizing textual entailment challenge. Cited by: §3.
 SemEval2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), Vancouver, Canada, pp. 1–14. External Links: Link, Document Cited by: §3.
 Quora question pairs. University of Waterloo. Cited by: §3.

Underspecification presents challenges for credibility in modern machine learning
. arXiv preprint arXiv:2011.03395. Cited by: §1.  BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: 3rd item, §1, §2, §2, §3, §3, §5.
 Finetuning pretrained language models: weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305. Cited by: §1.
 Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: §3.
 Deep dominance  how to properly compare deep neural models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2773–2785. External Links: Link, Document Cited by: §1.
 An introduction to the bootstrap. CRC Press. Cited by: §4.2.
 Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (3), pp. 369–390. Cited by: §4.2.
 A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Link, Document Cited by: 1st item.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.
 Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. Cited by: §2.
 Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: 4th item.

BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance.
In
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, Online, pp. 217–227. External Links: Link, Document Cited by: Figure 3, Table 1. 
Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference
. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. External Links: Link, Document Cited by: §1, §3.  Stereoset: measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456. Cited by: 2nd item.
 Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2340–2353. External Links: Link Cited by: §1.
 Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. Cited by: §2.
 Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 2463–2473. External Links: Link, Document Cited by: 3rd item.
 Sentence encoders on stilts: supplementary training on intermediate labeleddata tasks. arXiv preprint arXiv:1811.01088. Cited by: 1st item.
 SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §3, §3.
 A primer in BERTology: what we know about how BERT works. Transactions of the Association for Computational Linguistics 8, pp. 842–866. External Links: Link, Document Cited by: §1.
 Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §3.
 Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3645–3650. External Links: Link, Document Cited by: §1.
 BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4593–4601. External Links: Link, Document Cited by: 1st item.
 Wellread students learn better: on the importance of pretraining compact models. arXiv preprint arXiv:1908.08962. Cited by: §2.
 Asymptotic statistics. Vol. 3, Cambridge University Press. Cited by: §4.2.
 Analyzing multihead selfattention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5797–5808. External Links: Link, Document Cited by: 2nd item.
 GLUE: a multitask benchmark and analysis platform for natural language understanding. Note: In the Proceedings of ICLR. Cited by: §2, §3, §3.
 Neural network acceptability judgments. arXiv preprint 1805.12471. Cited by: §3.
 A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: §3.
 Transformers: stateoftheart natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §1.

Semanticsaware bert for language understanding.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 9628–9635. Cited by: 3rd item.  Are larger pretrained language models uniformly better? comparing performance at the instance level. arXiv preprint arXiv:2105.06020. Cited by: §1.
 The curse of performance instability in analysis datasets: consequences, source, and suggestions. arXiv preprint arXiv:2004.13606. Cited by: §1.

Aligning books and movies: towards storylike visual explanations by watching movies and reading books.
In
2015 IEEE International Conference on Computer Vision (ICCV)
, Vol. , pp. 19–27. External Links: Document, ISSN Cited by: §2.