On the comparability of Pre-trained Language Models

01/03/2020 ∙ by Matthias Aßenmacher, et al. ∙ Universität München 0

Recent developments in unsupervised representation learning have successfully established the concept of transfer learning in NLP. Mainly three forces are driving the improvements in this area of research: More elaborated architectures are making better use of contextual information. Instead of simply plugging in static pre-trained representations, these are learned based on surrounding context in end-to-end trainable models with more intelligently designed language modelling objectives. Along with this, larger corpora are used as resources for pre-training large language models in a self-supervised fashion which are afterwards fine-tuned on supervised tasks. Advances in parallel computing as well as in cloud computing, made it possible to train these models with growing capacities in the same or even in shorter time than previously established models. These three developments agglomerate in new state-of-the-art (SOTA) results being revealed in a higher and higher frequency. It is not always obvious where these improvements originate from, as it is not possible to completely disentangle the contributions of the three driving forces. We set ourselves to providing a clear and concise overview on several large pre-trained language models, which achieved SOTA results in the last two years, with respect to their use of new architectures and resources. We want to clarify for the reader where the differences between the models are and we furthermore attempt to gain some insight into the single contributions of lexical/computational improvements as well as of architectural changes. We explicitly do not intend to quantify these contributions, but rather see our work as an overview in order to identify potential starting points for benchmark comparisons. Furthermore, we tentatively want to point at potential possibilities for improvement in the field of open-sourcing and reproducible research.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For the approaches towards most NLP tasks, researchers turn to using pre-trained word embeddings (mikolov2013efficient; pennington2014glove; bojanowski2017enriching) as a key component of their models. The representations map each word of a sequence

to a real valued vector of dimension

. A drawback of these kinds of externally learned features is that they are (i) fixed, i.e. can not be adapted to a specific domain they are used in, and (ii) context independent, i.e. there’s only one embedding for a word by which it is represented in any context.

More recently, transfer learning approaches, as for example using convolutional neural networks (CNNs) pre-trained on ImageNet


in computer vision, have entered the discussion. Transfer learning in NLP context means pre-training a network with a self-supervised


With self-supervised learning we refer to a technique, where the labels are automatically generated from the data itself without relying on external labels.

objective on large amounts of plain text and fine-tune its weights afterwards on a task specific, labelled data set. For a comprehensive overview on the current state of transfer learning in NLP, we recommend the excellent tutorial and blog post by ruder-etal-2019-transfer222https://ruder.io/state-of-transfer-learning-in-nlp/.
With ULMFiT (Universal Language Model Fine Tuning), howard2018universal proposed a LSTM-based (hochreiter1997long) approach for transfer learning in NLP using AWD-LSTMs (merity2017regularizing). After pre-training on a large unlabelled corpus, a task-specific layer is added to the network and the whole network is fine-tuned using labelled data. This model can be characterised as unidirectional contextual, while a bidirectionally contextual LSTM-based model was presented in ELMo (Embeddings from Language Models) by peters2018deep.

Architectural Details
Model Release Architecture Contextuality Transfer learning integration
Word2Vec 01/2013 FCNN None Embedding model
FastText 07/2016 FCNN None Embedding model
ULMFiT 01/2018 Forward LSTM Unidirectional Fully end-to-end trainable
ELMo 02/2018 biLSTM Bidirectional Embedding model
GPT 06/2018 Transformer Unidirectional Fully end-to-end trainable
BERT 10/2018 Transformer Bidirectional Fully end-to-end trainable
GPT2 02/2019 Transformer Unidirectional Fully end-to-end trainable
XLNet 06/2019 Autoregressive Bidirectional Fully end-to-end trainable
RoBERTa 07/2019 Transformer Bidirectional Fully end-to-end trainable
ALBERT 09/2019 Transformer Bidirectional Fully end-to-end trainable
Table 1: Summarizaton of the basic facts of the evaluated model architectures. Despite not being a central part of this evaluation, Word2Vec and FastText are added as baseline comparisons. With Transfer learning integration, we try to specify to which degree the model is capable for transfer learning. We distinguish between embedding models and end-to-end trainable transfer learning models.

The bidirectionality in ELMo is achieved by using biLSTMs instead of AWD-LSTMs. On the other hand, ULMFiT uses a more "pure" transfer learning approach compared to ELMo, as the ELMo-embeddings are extracted from the pre-training model and are not fine-tuned in conjunction with the weights of the task-specific architecture.
The OpenAI GPT (Generative Pre-Training, radford2018improving) is a model which resembles the characteristics of ULMFiT in two crucial points. It is a unidirectional language model and it allows stacking tasks specific layers on top after pre-training, i.e. it is fully end-to-end trainable. The major differences between these two models is the architecture inside the LM, where OpenAI GPT uses the Transformer architecture (vaswani2017attention).
Instead of processing one of the input tokens at a time, like recurrent architectures (LSTMs, GRUs) do, the Transformer takes in the whole sequence all at once. This is possible because it utilizes a variant of the Attention mechanism (bahdanau2014neural), which allows to model dependencies without having to feed the data to the model sequentially. At the same time, the OpenAI GPT can be characterised as unidirectional model as it just takes into account the left side of the context. Its successor OpenAI GPT2 (radford2019language) possesses (despite some smaller architectural changes) mainly the same model architecture and can thus also be termed as unidirectional contextual.
Original BERT (Bidirectional Encoder Representations from Transformers, devlin2018bert), and consequently the other two BERT-based approaches discussed here (liu2019roberta; lan2019albert) as well, differ from the GPT models by the fact that they are bidirectional Transformer models. devlin2018bert developed Masked Language Modelling (MLM) as a special training objective which allows the use of a bidirectional Transformer without compromising the language modelling objective. XLNet (yang2019xlnet) on the contrary relies on an objective which the authors call Permutation Language Modelling (PLM) and thus also achieves to model a bidirectional context despite being an auto-regressive model. A brief overview on the characteristics of the explained models can be found in table 1.

2 Related work

In their stimulating paper, raffel2019exploring take several steps in a similar direction by trying to ensure comparability among different transformer-based models. They perform various experiments with respect to the transfer learning ability of a transformer encoder-decoder architecture by varying the pre-training objective (Different variants of denoising vs. language modelling), the pre-training resources (their newly introduced C4 corpus vs. variants thereof) and the parameter size (from 200M up to 11B). Especially, their approach of introducing a new corpus and creating subsets resembling previously used corpora like RealNews (zellers2019defending) or OpenWebText (gokaslan2019openweb) is a promising approach in order to ensure comparability.
However, their experiments do not cover an important point we trying to address in our paper:
Focussing on only one specific architecture does not yield an answer to the question which components explain the performance differences between two models where the overall architecture differs as well (e.g. Attention-based vs. LSTM-based). yang2019xlnet also address model comparability to some extent by performing an ablation study to compare their XLNet explicitly to BERT (devlin2018bert). In this ablation study, they train six different XLNet-based models where they modify different parts of the models in order to quantify how these design choices influence performance. At the same time they restrict themselves to an architecture of the same size as BERT-base and use the same lexical resources for pre-training. liu2019roberta vary their RoBERTa model with respect to model size and use of pre-training resources in order to perform an ablation study aiming at comparability to BERT. lan2019albert go even one step further with their ALBERT model by also comparing their model to BERT with regard to run time and width/depth of the model.

Despite all these experiments are highly valuable steps into the direction of better comparability, there are still no clear guidelines on which comparisons to perform in order to ensure a maximum degree of model comparability with respect to potentially influential factors.

3 Materials and Methods

First, we will present the different available corpora which were utilised for pre-training the models and compare them with respect to their size, the domain they’re from and their accessibility. Subsequently, we will briefly introduce common benchmark data sets which the models are fine-tuned and evaluated on.
While the conceptual differences between the evaluated models have already been addressed in the introduction, the models will now be described in more detail. This is driven by the intention to emphasise differences beyond the obvious, conceptual ones.

3.1 Training corpora

We will start this chapter by briefly introducing the pre-training resources, which are commonly used. While there are some corpora that are commonly used by most of the models, some other corpora are often just used by one model in conjunction with one of the more popular ones. An overview is to be found in table 2.

English Wikipedia


state that they used data from the English Wikipedia and provide a manual for crawling it, but no actual data set. Their data encompassed around 2.5B words. Wikipedia data sets are available in the Tensorflow


Corpus Source Word-count Accessability Used by
English Wikipedia tensorflow 2.500M Fully available BERT; XLNet;
CommonCrawl https://commoncrawl.org/ Unclear Fully available XLNet
ClueWeb 2012-B callan2009clue09 Unclear Fully available ($$) XLNet
Giga5 parker2011english Unclear Fully available ($$) XLNet
1B Word Benchmark chelba2013one 830M Fully available ELMo
BooksCorpus zhu2015aligning 985M Not available OpenAI GPT; BERT;
Wikitext-103 merity2016pointer 103M Fully available ULMFit
CC-News liu2019roberta Unclear Crawling Manual RoBERTa
Stories trinh2018simple 7.000M Fully available RoBERTa
WebText radford2019language Unclear Not available OpenAI GPT2
OpenWebText gokaslan2019openweb Unclear Fully available RoBERTa
We report the word-count as given in the respective articles proposing the corpora. Note that the number of tokens reported in depends on the tokenization scheme used by a specific model.
Stated by one of the authors on twitter: https:/twitter.com/thtrieu_/status/1096672446864748545
Table 2: Pre-training resources used by the language models (sorted by release date). Concerning the Accessability, the category Crawling Manual can be ranked between the two other categories. In this case, the authors did not provide the data, but at least a (more or less detailed) manual for crawling the data (or similar data) oneself. The dollar signs in brackets signify the necessity of a payment in order to get access to the corpus. There’s no information on RealNews (zellers2019defending) and C4 (raffel2019exploring) as these corpora were not used by the evaluated models.


Among other resources, yang2019xlnet used data from CommonCrawl. Besides stating that they filtered out short or low-quality content no further information is given. Since CommonCrawl is a dynamic database, which is updated on a monthly base, and the extracted amount of data always depends on the user, we can not provide a word count for this source in table 2.

ClueWeb (callan2009clue09), Giga5 (parker2011english)

The information about the use of ClueWeb and Giga5 is similarly sparse as for CommonCrawl (all three were used for pre-training XLNet). ClueWeb was obtained by crawling 2.8M web pages in 2012, Giga5 was crawled between 01/2009 and 12/2010.

1B Word Benchmark444https://ai.google/research/pubs/pub41880 (chelba2013one)

This corpus, actually introduced as a benchmark data set by chelba2013one back in 2013, combines multiple data sets from the EMNLP 2011 workshop on Statistical Machine Translation555http://statmt.org/wmt11/ (WMT11). The authors normalised and tokenized the corpus and performed further pre-processing steps in dropping duplicate sentences as well as discarding words with a count below three. Additionally they randomised the ordering of the sentences in the corpus. This constitutes a corpus with a vocabulary of 793.471 words and a total word count of 829.250.940 words.

BooksCorpus666https://yknzhu.wixsite.com/mbweb (zhu2015aligning)

With their work from 2015, Zhu et al. introduced two corpora: the MovieBook Dataset and the BooksCorpus, with the latter one being heavily used for pre-training language models (cf. table 2). In their work, they used the BooksCorpus in order to train a model for retrieving sentence similarity.
Overall, the corpus comprises 984.846.357 words777devlin2018bert report that the BooksCorpus consists of only 800M words; we assume that the differences are attributed to potentially different pre-processing in 74.004.228 sentences obtained from analysing 11.038 books. The vocabulary consists of 1.316.420 unique words, making the corpus lexically more diverse than the 1B Word Benchmark as it possesses a by 66% larger vocabulary whereas having a word count which is only 19% higher. Unfortunately it is not available for public download anymore, the authors just provide a link to the ebook-store where they scraped the corpus.

Wikitext-103888http://academictorrents.com/details/a4fee5547056c845e31ab952598f43b42333183c (merity2016pointer)

merity2016pointer emphasised the necessity for a new large scale language modelling data set by stressing the shortcomings of other corpora. They explicitly highlight the occurrence of complete articles, which allow the models to learn long range dependencies, as one of the main benefits of their corpus. This property is, according to merity2016pointer, not given in the 1B Word Benchmark as the sentence ordering is randomised there. With a count of 103.227.021 tokens and a vocabulary size of 267.735 it is about one eighth of the 1B Word Benchmark’s size concerning token count and about one third concerning the vocabulary size. Note, that there is also a smaller corpus available 999Wikitext-2: http://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6, which is a subset of about 2% of the size of Wikitext-103.

CC-News (nagel2016ccnews)

The CC-News corpus was presented and used in liu2019roberta. They used a web crawler proposed by hamborg2017news to extract data from the CommonCrawl News data set (nagel2016ccnews) and obtained a data set similar to the RealNews data set (zellers2019defending).

Stories101010https://console.cloud.google.com/storage/browser/commonsense-reasoning/reproduce/stories_corpus (trinh2018simple)

This data set is also a specific subset of the CommonCrawl data. The authors built the data based on questions in common sense reasoning tasks. They extracted nearly 1M documents, most of which are taken from longer, coherent stories (hence the name of the corpus). One of the authors stated on twitter111111https://twitter.com/thtrieu_/status/1096672446864748545 that the corpus contains approximately 7B words.

WebText (radford2019language)

The data set GPT2 was pre-trained on, is not publicly available and was obtained by creating "a new web scrape which emphasised document quality" (radford2019language).


As a reaction to radford2019language not releasing their pre-training corpus, gokaslan2019openweb started an initiative to emulate an open-source version of the WebText corpus.

It becomes obvious that there is a lot of heterogeneity with respect to the observed combinations of availability and the clear specification of the corpus size as word count. Some corpora specify their size in gigabytes, but do not provide a token count or a vocabulary size. Thus, we can state that there is some lack of transparency when it comes to the lexical resources used for per-training. Especially, the missing availability of the BooksCorpus is problematic as this corpus is heavily used for pre-training.

3.2 Benchmark data sets for fine-tuning

Besides describing pre-training resources, it is also important to have a look at the data sets which are commonly used for benchmarking fine-tuned language models and thus determine new SOTA results.

Glue131313https://gluebenchmark.com/ (wang2018glue)

The General Language Understanding Evaluation (GLUE) benchmark is a freely available collection of nine data sets which models can be evaluated on. It also provides a fixed train-dev-test split with held out labels for the test set, as well as a leader board which displays the top submissions and the current SOTA. The relevant metric for the SOTA is an aggregate measure of the nine single task metrics.
Table 3 provides the basic information on the data sets included in GLUE. The benchmark includes two binary classification tasks with single-sentence inputs (CoLa [warstadt2018neural] and SST-2 [socher2013recursive]) and five binary classification tasks with inputs that consist of sentence-pairs (MRPC [dolan2005automatically], QQP141414https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs, QNLI [wang2018glue], RTE [wang2018glue] and WNLI [wang2018glue]). The remaining two tasks also take sentence-pairs as input but have a multi-class classification objective with either three (MNLI [williams2017broad]) or five classes (STS-B [cer2017semeval]).

Single-Sentence Similarity/Paraphrase Inference
|train| 8.5k 67k 3.7k 7k 364k 393k 105k 2.5k 634
|test| 1k 1.8k 1.7k 1.4k 391k 20k 5.4k 3k 146
domain misc. movies news misc. social QA misc. wiki news, wiki fiction
Table 3: A brief summarizaton of the different data sets which all together form the GLUE benchmark. This table is basically a rearrangement of table 1 from wang2018glue with slightly reduced information as it is just thought to be an overview on the different tasks and data set sizes.

SuperGLUE151515https://super.gluebenchmark.com/ (wang2019superglue)

As a reaction to human baselines being surpassed by the top ranked models, wang2019superglue proposed a set of benchmark data sets similar to, but, according to the authors, more difficult than GLUE. On average, the size of the provided training data is smaller than in GLUE and, differently to GLUE, the data is also split in ’train’, ’dev’ and ’test’ as in GLUE. As of the writing of this paper, there is a large difference between the use of GLUE and SuperGLUE concerning the number of models evaluated on the respective benchmark.

Coreference Disambig. Inference Question Answering
|train| 554 6k 2.5k 250 9.4k 400 5.1k 101k
|dev| 104 638 278 57 3.3k 100 953 10k
|test| 146 1.4k 300 250 3.2k 500 1.8k 10k
domain fiction misc. news, wiki misc. google, wiki blogs, art misc. news
Table 4: A brief summarizaton of the different data sets which all together form the SuperGLUE benchmark. This table is basically a rearrangement of table 1 from wang2019superglue with slightly reduced information as it is just thought to be an overview on the different tasks and data set sizes.

It is considered to be more difficult than GLUE as it contains more complex tasks than just single-sentence or sentence-pair classification. SuperGLUE also features coreference resolution and question answering tasks. Unfortunately, it did not make sense to include it as a part of our model comparison, as (at the time of writing) only two of the discussed models were evaluated on SuperGLUE.

SQuAD161616https://rajpurkar.github.io/SQuAD-explorer/ (rajpurkar2016squad; rajpurkar2018know)

In its first version, the Stanford Question Answering Dataset (SQuAD) 1.1 (rajpurkar2016squad) consists of 100.000+ questions explicitly designed to be answerable by reading segments of Wikipedia articles. The task is to correctly locate the segment in the text which contains the answer. A shortcoming of this task is the omission of situations where the the question is not answerable by reading the provided article. rajpurkar2018know address this problem in SQuAD 2.0 by adding 50.000 handcrafted unanswerable questions to the SQuAD 1.1 data set. On their homepage, the authors provide a train and development set as well as an official leader board. The test set is completely held out. Instead, the participants are required to upload their models to CodaLab171717https://codalab.org/. The SQuAD 1.1 data is, in an augmented form (termed QNLI), also part of the GLUE benchmark.

Race181818http://www.qizhexie.com/data/RACE_leaderboard.html (lai2017race)

The Large-scale ReAding Comprehension Dataset From Examinations (RACE) contains (english) exam questions for Chinese students (middle and high school). In most of the articles, where the model is evaluated on RACE, it is described to be especially challenging due to (i) the length of the passages, (ii) the inclusion of reasoning questions and (iii) the intentionally tricky design of the questions in order to test a human’s ability in reading comprehension. The data set can be subdivided in RACE-M (middle school examination) and RACE-H (high school examination) and comprises a total of 97.687 questions on 27.933 passages of text.

3.3 Evaluated Models

ULMFit (howard2018universal)

The first "pure" transfer learning applied in NLP was ULMFiT in the beginning of 2018. The core of the model builds on the work from merity2017regularizing as it uses AWD-LSTMs, which is a LSTM-variant that makes use of DropConnect (wan2013regularization)

for better regularisation and applies averaged stochastic gradient descent (ASGD) for optimization

(polyak1992acceleration). This model consists of a 400 dimensional embedding layer followed by three LSTM layers, each of which encompasses 1150 hidden units. howard2018universal

stack a softmax classifier with a hidden layer size of 50 on top of this architecture for pre-training the model. This final layer is complemented by a task specific final layer during fine tuning. The vocabulary size is limited to 30k words as in

In contrast to the other models discussed in this paper, ULMFiT was not evaluated on the GLUE benchmark but on several other data sets (IMDb [maas2011learning], TREC-6 [voorhees1999trec], Yelp-bi, Yelp-full, AG’s news, DBpedia [all zhang2015character]).

Question Sentiment Topic
task TREC-6 IMDb Yelp-bi Yelp-full AG’s news DBpedia
|train| 5 k 25k 560k 650k 120k 560k
|test| 0.5k 25k 38k 50k 7.6k 70k
domain open-domain movies social QA social QA news wiki
Table 5: An overview on the data sets which ULMFit was fine-tuned and evaluated on. It is an extension of table 1 (howard2018universal), adding information on the size of the test set and the domain. All six tasks are classification tasks, where the target variables have between 2 and 14 classes.

ELMo (peters2018deep)

As already stated in section 1, ELMo differs from ULMFit with respect to its usability for transfer learning. The pre-trained ELMo-embeddings are plugged in at the lowest layer of an arbitrary NLP model in order to use them for a downstream task191919The authors also mention that additionally adding the ELMO-embeddings at one of the final layers might improve performance for some architectures and tasks. In case of ELMo this means the following: As ELMo consists of multiple biLSTM layers, one can extract multiple intermediate-layer representations from the model. These representations are used for computing a (task-specific) weighted combination, which is concatenated with static context-independent word embeddings. So the model weights of ELMo are not updated during the training of the downstream model, but only the weights, which are learned for combining the intermediate-layer representations from ELMo, are. peters2018deep evaluate an ELMo-based model on SQuAD and other tasks, but when it comes to GLUE there are multiple ELMo-based architectures available on the leaderboard202020https://gluebenchmark.com/leaderboard. Thus, here we will concentrate on the best-performing ELMo-based model, a BiLSTM-model with Attention (wang2018glue).

OpenAI GPT (radford2018improving)

The OpenAI GPT is a pure attention-based architecture the does not make use of any recurrent layers. Pre-training is performed by combining Byte-Pair encoded (sennrich2015neural) token embeddings with learned position embeddings, feeding them into a multi-layer transformer decoder architecture with a standard language modelling objective. By using a decoder architecture the model does at each step only have access to the preceding tokens in the sequence. Thus, the GPT model is a unidirectional attention-based architecture. Fine-tuning was, amongst others, performed on the nine tasks that together form the GLUE benchmark.

Bert (devlin2018bert)

This model can be seen as a reference point for everything that came thereafter. Similar to GPT it uses Byte-Pair Encoding (BPE) with a vocabulary size of 30k. By introducing the MLM training objective, the authors were able to combine deep bidirectionality with the self-attention mechanism for the first time. In addition to the MLM objective it also utilizes as next-sentence prediction (NSP) objective, the usefulness of which has been debated in other research papers (liu2019roberta). The BERT-BASE model consists of 12 bidirectional transformer-encoder blocks (24 for BERT-LARGE) as described in vaswani2017attention with 12 (16 respectively) attention heads per block and an embedding size of 768 (1024 respectively). The need to better understand the behaviour of these huge networks even constituted a new field of research called BERTology, aiming at explaining the inner workings of BERT-based models.

OpenAI GPT2 (radford2019language)

With GPT2, the OpenAI team published a scaled-up version of GPT in 2019. Compared to its predecessor, it contains some smaller changes concerning the placement of layer normalisation and residual connections. Overall, there are four different versions of GPT2 with the smallest one being equal to GPT, the medium one being of similar size as BERT-LARGE and the xlarge one being released as the actual GPT2 model with 1.5B parameters.

XLNet (yang2019xlnet)

In order to overcome (what they call) the pretraining-finetune discrepancy, which is a consequence of BERT’s masking approach, and to simultaneously include bidirectional contexts, yang2019xlnet propose the PLM objective for their XLNet. They use two-stream self-attention for preserving the position information of the token to be predicted, which would otherwise be lost due to the permutation of the sequence. While the first of the two streams (content stream attention) resembles the standard self-attention from a transformer-decoder, the other stream (query stream attention) doesn’t allow the token to see itself but just the preceding tokens of the permuted sequence.

RoBERTa (liu2019roberta)

With RoBERTa (short for Robustly optimized BERT approach), liu2019roberta

introduce an exact (architectural) replicate of BERT with tuned hyperparameters and a larger corpus used for pre-training. The masking strategy for pre-training is changed from static (masking once during pre-processing) to dynamic (masking every sequence just before feeding it to the model), the additional NSP objective is removed, the BPE-level vocabulary is adjusted and increased to 50k and RoBERTa is trained on larger batches than BERT. All of these adjustments improve performance of the model and make it competitive to the previously SOTA results of XLNet.

Albert (lan2019albert)

By addressing the steady increase of the model size as a potential problem, ALBERT (short for A Lite BERT) goes into another direction compared to most of post-BERT architectures. lan2019albert apply parameter-reduction techniques in order to train faster models with lower memory demands that, at the same time, yield a comparable performance to SOTA models. In our work we will always refer to ALBERT-XXLARGE, which is the best performing ALBERT model. Note, that also the much smaller ALBERT models yielded results comparable to or even better than BERT.

4 Model comparison

The two tables below will try to give a comprehensive overview on the differences of the previously discussed model architectures. While table 6 will only attempt to give an overview on the amount of computation that was needed to train a given architecture on a given corpus, we will directly try to relate model architecture and size as well as usage of lexical resources to model performance in table 7.

Compute Resources
Model Computational Resources Training time pfs-days size lexical
GPT 8 GPUs (P600) 30 days 0.96 117M < 13GB
BERT-BASE 4 Cloud TPUs (16 chips) 4 days 0.96 [2.24] 110M 13GB
BERT-LARGE 16 Cloud TPUs (64 chips) 4 days 3.84 [8.96] 340M 13GB
GPT2-XLARGE 8 v3 Cloud TPUs (32 chips) 7 days 7.84 1.500M 40GB
XLNet-LARGE 128 v3 Cloud TPUs (512 chips) 2.5 days 44.8 340M 126GB
RoBERTa DGX-1 GPUs (8x32GB V100) NA NA 360M 160GB
ALBERT 64 – 1024 v3 Cloud TPUs NA NA 233M 16GB
Estimation according to the formula proposed on https://openai.com/blog/ai-and-compute/:
, with an assumed utilization of one third. Information on PFLOPS/unit for TPUs from https://cloud.google.com/tpu/.
We provide two numbers here, as devlin2018bert do not specify whether they use v2 or v3 TPUs. The first number assumes the use of v2 TPUs, the one in square brackets assumes use of v3 TPUs.
Table 6: Summarizaton of the basic facts of the evaluated transfer learning model architectures. Word2Vec, FastText and ELMo are not included as these are no end-to-end trainable models, meaning that the model size also depends of the used model after obtaining the embeddings. The parameter size of ULMFiT is assumed to be the larger value from merity2017regularizing, since howard2018universal use plain AWD-LSTMs with a vocabulary size of 30k tokens like johnson2016convolutional; johnson2017deep. Values for GPT2-XLARGE are taken from strubell2019energy.

One thing that we can learn from table 6 is the unfortunate lack of details when it comes to reporting the computational resources used for training the models. While howard2018universal do not provide any information at all on the computational resources utilised for pre-training ULMFiT, the other articles are also not over-informative when it comes to reporting them. Unfortunately, there are no clear guidelines on how to appraise resource consumption when it comes to evaluating and comparing models. This may be partly attributed to the rapidly growing hardware possibilities due to modern cloud computing architectures, but in our opinion it should nevertheless be accounted for, since it may pose environmental issues (strubell2019energy) and also limits portability to smaller devices.

The second thing is that it is also important to consider the differences displayed in the tables 6 and 7 when comparing the model performances. When comparing two models of approximately the same size (e.g. BERT-BASE versus GPT), it seems to be obvious that the superior performance of BERT-BASE originates purely from its more elaborated model architecture (cf. table 1) because of the similar size. But one should also be aware of the larger pre-training resources (BERT-BASE uses at least twice as much data for pre-training) as well as the unknown differences in usage of computing power. We estimated the amount of compute used by a model as the pfs-days, resulting in an estimation for BERT-BASE being not less than the one for GPT.
Another aspect which should not be ignored when evaluating performance is the use of ensemble models. As can be seen in the first column of table 7, the three ensemble models seem to outperform both of the BERT models by a large margin. Only parts of these differences may be attributed to the model architecture, as the ensembling as well as the larger pre-training resources might also give an advantage to these models. As there are unfortunately no single model performance values available for XLNet, RoBERTa and ALBERT on the official GLUE leaderboard, we also compare the single model performances from lan2019albert obtained on the dev sets (WNLI excluded). From this comparison we can get a good impression of how high the contribution of model ensembling might be: The difference between BERT-LARGE and the XLNet ensemble in the official scores (7.9 percentage points) is more than twice as high as the difference on the dev score (3.4 percentage points).
In order to address the differences in size of the pre-training resources, yang2019xlnet make the extremely insightful effort to compare a BASE variant of XLNet to BERT-BASE (same size and same pre-training resources). While the F1 score on the SQuAD v2.0 dev set is still remarkably higher than for BERT-BASE (almost comparable to BERT-LARGE) it does not show a large improvement on the RACE test set anymore (which might have been expected due to the large improvement of XLNet-LARGE over both BERT models).

Model leaderboard dev v1.1 (dev) v2.0 (dev) test size lexical
BERT-BASE 78.3 88.5 76.3 65.0 110M 13GB
ELMo-based - 8.3 - 2.9
GPT - 5.5 - 6.0 1.1x < 0.5x
BERT-LARGE + 2.2 84.05 + 2.4 + 5.6 + 7.0 3.1x 1.0x
XLNet-BASE + 5.03 + 1.05 1.0x 1.0x
XLNet-LARGE + 10.1 + 3.39 + 6.0 + 12.5 + 16.75 3.1x 9.7x
RoBERTa + 10.2 + 5.19 + 6.1 + 13.1 + 18.2 3.3x 12.3x
RoBERTa-BASE + 2.30 1.0x 12.3x
RoBERTa + 3.79 + 5.1 + 11.0 3.3x 1.2x
ALBERT + 11.1 + 5.91 + 5.6 + 13.9 + 21.5 2.1x 1.2x
Ensemble performance; No single model performance available
Own calculations based on lan2019albert table 13; WNLI is omitted
Result for BERT-BASE on SQuAD v2.0 is taken from yang2019xlnet table 6
Result for BERT-BASE on RACE is taken from zhang2019dual table 2
liu2019roberta and lan2019albert specify the BooksCorpus + English Wikipedia as 16GB
This variant of RoBERTa uses only BooksCorpus + English Wikipedia for pre-training
Table 7: Performance of different models on GLUE, SQuAD and RACE as well as model size and resource usage compared to BERT-BASE (except for GLUE dev set performance, where BERT-LARGE is the reference). Performance differences on the benchmark data sets are given in percentage points, while the differences in size/resources are given as factors, e.g. BERT-LARGE has 3.1 times the size of BERT-BASE and performs 2.2 percentage points better on GLUE. We omit SuperGLUE in this table as of the time of writing only BERT and RoBERTa were evaluated on it. ULMFiT and OpenAI GPT2 are also omitted as there are no performance values on these data sets publicly available. Highest improvements over the reference model in bold. For ELMo we do not provide a model size, since the performance values are from two different models (cf. section 3.3).
Displayed performance measures are Matthews Correlation (GLUE), F1 score (SQuAD) and Accuracy (RACE).

The comparability of RoBERTa from the GLUE leaderboard (model ensemble and larger pre-training resources) to BERT-LARGE is again limited, but the authors performed several experiments in order to show the usefulness of their model optimisations. When pre-training BERT-LARGE and a single RoBERTa model on comparable lexical resources (BooksCorpus + English Wikipedia; 13GB for BERT vs. 16GB for RoBERTa), the RoBERTa model still shows a significant improvement over BERT-LARGE, even if it decreases somewhat in size (compared to the difference between BERT-LARGE and the ensemble model). In another ablation study, liu2019roberta train a BASE variant of RoBERTa on their larger pre-training resources. Even though comprising only about one third of the size of BERT-LARGE, the larger pre-training corpus in conjunction with the optimised training leads to a slightly better performance on the GLUE dev set (without WNLI). Unfortunately we cannot compare RoBERTa-BASE to BERT-BASE, as we neither have the "official" leaderboard score for RoBERTa-BASE nor the "in-official" dev set score for BERT-BASE.

Sentiment Topic Resources
Model IMDb Yelp-bi Yelp-full AG’s news DBpedia size lexical
ULMFiT 95.40 97.84 70.02 94.99 99.20 33M 0.18GB
BERT-LARGE + 0.09 + 0.27 + 0.66 + 0.16 10.3x 72.2x
XLNet-LARGE + 0.81 + 0.61 + 2.28 + 0.52 + 0.18 10.3x 222.2x
Table 8: Performance of BERT-LARGE and XLNet-LARGE on the benchmark data sets used by howard2018universal as well as model size and resource usage compared to ULMFiT. Specification of the differences are displayed as in table 7, highest improvements over the reference model in bold. Note that we report accuracies here, as opposed to howard2018universal and yang2019xlnet, in order to provide a more similar interpretation of these values compared to the values in table 7. Displayed performance measures are Accuracy for all tasks.

In order to also set the results of ULMFiT into context, we present the results published by yang2019xlnet alongside with the information on model size and use of lexical resources in table 8. Despite being much larger and utilising some orders of magnitude larger corpora for pre-training, both BERT-LARGE and XLNet-LARGE do not exhibit that large improvements over the performance of ULMFiT. This might partly originate from the simplicity (compared to GLUE & co.) of the tasks, but partly also from the already achieved high performances where no extremely large improvements are possible anymore.

5 Discussion

This chapter reflects the main takeaways from the above comparisons and tries to raise some issues for future research practices. We do not claim to have a solution to these potentially problematic aspects but think that these points are highly debatable.

Why no benchmark corpus for pre-training?

It is good and well-established practice to use benchmark data sets like GLUE, SuperGLUE (not yet used that often), SQuAD and RACE for comparing the performance of pre-trained language models on different types of NLP/NLU tasks. Many recently published articles (liu2019roberta; yang2019xlnet; lan2019albert) perform (partly extensive) ablation studies controlling for pre-training resources in order to make (versions of) their models comparable to BERT as "benchmark model", which is really important as it helps the reader to get an intuition for the impact of pre-training resources. Nevertheless, it is unfortunately not perfect due to two critical issues: (i) BERT (and all the other models consequently as well) make use of the BooksCorpus (zhu2015aligning) which is not publicly available and (ii) this only leads to model comparisons in a low pre-training resource environment (compared to more recent models) and yields no insight on the behaviour of the reference model (e.g. BERT) in a high(er) pre-training resource context. So we view statements of the type "Model architecture A is superior to model architecture B on performing task X." somewhat critical and would propose to phrase it in a way comparable to the following statement: "Model architecture A is superior to model architecture B on performing task X, when pre-trained on a small/large corpus of low/high quality data from domain Y for time Z."

Why no standardised description of (computational) resources?

When writing this article, it sometimes turned out difficult to really get one (measure) for how much compute was used to pre-train the model described in an article. In our opinion, this is not a carelessness of the authors but rather the lack of a clear reporting standard. We found ourselves confronted with the following situations:

  1. No information at all (radford2019language)

  2. Information on the used hardware (liu2019roberta; lan2019albert)

  3. Information on the used hardware and training time (devlin2018bert; yang2019xlnet)

  4. Calculation of a standardised measure (radford2018improving)212121The calculation was not published as part of the article but is to be found in a corresponding blog post:

While situation a) is clearly unsatisfactory and should be avoided, scenarios b) and c) basically provide (almost) all of the necessary information but miss out on going the last final step to scenario d) where the reporting would reach universal comparability across different articles. A quite nice and intuitive way was also proposed on the OpenAI-blog222222https://openai.com/blog/ai-and-compute/ for estimating the GPU time needed for model training. This is of course not as exact as a computation based on the counts of operations in a model, but requires on the other hand no deep insight into the model architecture and is thus applicable to a a wide range of architectures without much effort.

Shouldn’t performance be evaluated in relation to size and resource consumption?

As larger models have a higher capacity for learning good representations and using larger pre-training resources should also improve their quality, varying these two components simultaneously with the model architecture might lead to interference between the individual influences on model performance. So the intent of this aspect has a slight overlap with the question posed above, but while the above is more or less about introducing some kind of reference, this is about carefully varying and evaluating the effects of different parts of the model.

6 Conclusion

As can be seen from the above analysis, there is a clear lack of a concise guideline for fair comparisons of large pre-trained language models. It is not sufficient to just rank models by their performance on the common benchmark data sets as this does not take into account all the other factors mentioned in this analysis.

Reporting Standard Feasibility Current Relevance for OK?
realisation reproducability
Model architecture Easy Every article Crucial
Number of parameters Easy Most articles Crucial
– Tuning method Easy Some articles High
– Tuning time Easy No article Medium
Experimental time Difficult No article Low
Computational resources Easy Most articles High
Training time Easy Most articles Medium
Lexical resources
– Information Easy Every article Crucial
– Availability Difficult Some articles Crucial
Benchmark performance
– Un-Tuned single model Difficult No article Low
– Tuned single model Easy Every article Crucial
Table 9: Proposal of starting points when thinking about reporting standards for pre-trained LMs. We categorise the reporting of the experimental time and the benchmark performance of the un-tuned model as not easily feasible, as one has to be aware of these standards in order to track the time of all experiments. Also, defining what is an "un-tuned" version is not always that simple. With "un-tuned" we mean not further tuned during pre-training.

A further aspect (which is not explicitly addressed here) is the reporting of resources (time and compute) spent on model development, including all experimental runs and trials, and hyperparameter tuning during pre-training. In our opinion, this is important with respect to two facets: On the one hand side it is important to take into account energy and environmental considerations when training deep learning models

(strubell2019energy), on the other hand it is also a signal to the reader/user for how difficult it is to train (and to fine-tune) the model. This might have implications for the usage of a model as transfer learning model for diverse downstream tasks. Models that have already been tuned to a high degree during pre-training to reach a certain level of performance, have, in the long run, maybe less potential for further improvements than models which do so without much hyperparameter tuning.
Taking all these considerations into account, we want to tentatively propose starting points (cf. table 9) for defining reporting standards which are globally accepted and applied when it comes to comparing pre-trained language models. We carefully try to categorise the different facets according to feasibility (How much effort does it take to report this?), current realisation (How many research papers are reporting this?) and their relevance for reproducible research (How crucial is this for performing reproducible research?). All these categorisations are of more or less subjective nature due to the fact that they cannot be quantified and are based on just a handful of the most influential research papers.
We are aware of the fact, that it might take a large collective effort in order to establish some set of standards but we think that it is an absolutely crucial step to describe all the aspects we mentioned in a way that is as transparent as possible in order to foster replicability and reproducability.


List of abbreviations

Averaged stochastic gradient decent weight-dropped

bi-directional Long short-term memory

Byte-Pair Encoding
Convolutional neural network
Fully connected neural network
Gated recurrent unit
Long short-term memory
Masked Language Modelling
Natural Language Processing
Natural Language Understanding
Permutation Language Modelling