Benchmarking down-scaled (not so large) pre-trained language models

05/11/2021
by   M. Aßenmacher, et al.
0

Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes. At the same time, more fundamental components, such as the pre-training objective or architectural hyperparameters, are modified. In total, it is therefore difficult to ascribe changes in performance to specific factors. Since searching the hyperparameter space over the full systems is too costly, we pre-train down-scaled versions of several popular Transformer-based architectures on a common pre-training corpus and benchmark them on a subset of the GLUE tasks (Wang et al., 2018). Specifically, we systematically compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size. In our experiments MLM + NSP (BERT-style) consistently outperforms MLM (RoBERTa-style) as well as the standard LM objective. Furthermore, we find that additional compute should be mainly allocated to an increased model size, while training for more steps is inefficient. Based on these observations, as a final step we attempt to scale up several systems using compound scaling (Tan and Le, 2019) adapted to Transformer-based language models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2023

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

We study recent research advances that improve large language models thr...
research
01/29/2022

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

In recent years, large pre-trained Transformer-based language models hav...
research
02/08/2022

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

Pre-trained language models (LMs) are shown to easily generate toxic lan...
research
05/21/2022

Life after BERT: What do Other Muppets Understand about Language?

Existing pre-trained transformer analysis works usually focus only on on...
research
12/04/2020

Pre-trained language models as knowledge bases for Automotive Complaint Analysis

Recently it has been shown that large pre-trained language models like B...
research
10/24/2022

Effective Pre-Training Objectives for Transformer-based Autoencoders

In this paper, we study trade-offs between efficiency, cost and accuracy...
research
12/19/2022

Training Trajectories of Language Models Across Scales

Scaling up language models has led to unprecedented performance gains, b...

Please sign up or login with your details

Forgot password? Click here to reset