Reducing Transformer Depth on Demand with Structured Dropout

09/25/2019
by   Angela Fan, et al.
10

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2021

Multitask Finetuning for Improving Neural Machine Translation in Indian Languages

Transformer based language models have led to impressive results across ...
research
03/16/2020

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Bidirectional Encoder Representations from Transformers (BERT) has recen...
research
04/24/2020

Lite Transformer with Long-Short Range Attention

Transformer has become ubiquitous in natural language processing (e.g., ...
research
06/10/2019

Improving Neural Language Modeling via Adversarial Training

Recently, substantial progress has been made in language modeling by usi...
research
03/05/2020

Talking-Heads Attention

We introduce "talking-heads attention" - a variation on multi-head atten...
research
04/11/2021

UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost

Transformer architecture achieves great success in abundant natural lang...
research
06/25/2022

PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance

Large Transformer-based models have exhibited superior performance in va...

Please sign up or login with your details

Forgot password? Click here to reset