DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

10/02/2019
by   Victor Sanh, et al.
0

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remain challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40 capabilities and being 60 larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

In natural language processing (NLP), enormous pre-trained models like B...
research
04/07/2020

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

Recently, BERT has become an essential ingredient of various NLP deep mo...
research
05/26/2023

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

Distillation from Weak Teacher (DWT) is a method of transferring knowled...
research
04/08/2020

Poor Man's BERT: Smaller and Faster Transformer Models

The ongoing neural revolution in Natural Language Processing has recentl...
research
04/28/2022

RobBERTje: a Distilled Dutch BERT Model

Pre-trained large-scale language models such as BERT have gained a lot o...
research
03/30/2023

oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

In this paper, we introduce the range of oBERTa language models, an easy...
research
12/14/2021

Epigenomic language models powered by Cerebras

Large scale self-supervised pre-training of Transformer language models ...

Please sign up or login with your details

Forgot password? Click here to reset