Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

09/22/2021
by   Yi Tay, et al.
3

There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2020

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

Fine-tuning a pretrained transformer for a downstream task has become a ...
research
07/21/2022

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

There have been a lot of interest in the scaling properties of Transform...
research
07/19/2023

Gradient Sparsification For Masked Fine-Tuning of Transformers

Fine-tuning pretrained self-supervised language models is widely adopted...
research
04/13/2021

Understanding Transformers for Bot Detection in Twitter

In this paper we shed light on the impact of fine-tuning over social med...
research
07/20/2023

PASTA: Pretrained Action-State Transformer Agents

Self-supervised learning has brought about a revolutionary paradigm shif...
research
02/24/2022

Learning to Merge Tokens in Vision Transformers

Transformers are widely applied to solve natural language understanding ...
research
08/05/2021

Finetuning Pretrained Transformers into Variational Autoencoders

Text variational autoencoders (VAEs) are notorious for posterior collaps...

Please sign up or login with your details

Forgot password? Click here to reset