What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

04/12/2022
by   Thomas Wang, et al.
11

Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/15/2021

Multitask Prompted Training Enables Zero-Shot Task Generalization

Large language models have recently been shown to attain reasonable zero...
research
06/14/2023

Generate to Understand for Representation

In recent years, a significant number of high-quality pretrained models ...
research
06/13/2022

Language Models are General-Purpose Interfaces

Foundation models have received much attention due to their effectivenes...
research
10/18/2021

NormFormer: Improved Transformer Pretraining with Extra Normalization

During pretraining, the Pre-LayerNorm transformer suffers from a gradien...
research
05/13/2023

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieve...
research
10/09/2021

Vector-quantized Image Modeling with Improved VQGAN

Pretraining language models with next-token prediction on massive text c...
research
12/20/2022

In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models

Given the success with in-context learning of large pre-trained language...

Please sign up or login with your details

Forgot password? Click here to reset