How to Train BERT with an Academic Budget

04/15/2021
by   Peter Izsak, et al.
0

While large language models à la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours, using only 8 low-range 12GB GPUs. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/29/2020

CMV-BERT: Contrastive multi-vocab pretraining of BERT

In this work, we represent CMV-BERT, which improves the pretraining of a...
research
08/01/2020

Multi-node Bert-pretraining: Cost-efficient Approach

Recently, large scale Transformer-based language models such as BERT, GP...
research
07/26/2019

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but ...
research
12/20/2022

Pretraining Without Attention

Transformers have been essential to pretraining success in NLP. Other ar...
research
05/24/2023

Dynamic Masking Rate Schedules for MLM Pretraining

Most works on transformers trained with the Masked Language Modeling (ML...
research
06/18/2021

Distributed Deep Learning in Open Collaborations

Modern deep learning applications require increasingly more compute to t...
research
03/07/2022

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Hyperparameter (HP) tuning in deep learning is an expensive process, pro...

Please sign up or login with your details

Forgot password? Click here to reset