EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets

12/31/2020
by   Xiaohan Chen, et al.
0

Deep, heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focus on reducing inference cost/time, while still requiring expensive training process. Other works use extremely large batch sizes to shorten the pre-training time at the expense of high demand for computation resources. In this paper, inspired by the Early-Bird Lottery Tickets studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training. Comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks show that EarlyBERT easily achieves comparable performance to standard BERT with 35 45 training time.

READ FULL TEXT
research
08/15/2019

Visualizing and Understanding the Effectiveness of BERT

Language model pre-training, such as BERT, has achieved remarkable resul...
research
02/24/2022

TrimBERT: Tailoring BERT for Trade-offs

Models based on BERT have been extremely successful in solving a variety...
research
06/22/2022

reStructured Pre-training

In this work, we try to decipher the internal connection of NLP technolo...
research
11/05/2019

MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models

Recent state-of-the-art language models utilize a two-phase training pro...
research
08/20/2023

How Good Are Large Language Models at Out-of-Distribution Detection?

Out-of-distribution (OOD) detection plays a vital role in enhancing the ...
research
11/27/2021

Exploring Low-Cost Transformer Model Compression for Large-Scale Commercial Reply Suggestions

Fine-tuning pre-trained language models improves the quality of commerci...
research
05/25/2022

Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models

Knowledge Distillation (KD) is a prominent neural model compression tech...

Please sign up or login with your details

Forgot password? Click here to reset