LV-BERT: Exploiting Layer Variety for BERT

06/22/2021
by   Weihao Yu, et al.
12

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently ach...
research
07/15/2021

AutoBERT-Zero: Evolving BERT Backbone from Scratch

Transformer-based pre-trained language models like BERT and its variants...
research
05/31/2020

LRG at SemEval-2020 Task 7: Assessing the Ability of BERT and Derivative Models to Perform Short-Edits based Humor Grading

In this paper, we assess the ability of BERT and its derivative models (...
research
05/07/2021

ResMLP: Feedforward networks for image classification with data-efficient training

We present ResMLP, an architecture built entirely upon multi-layer perce...
research
10/20/2021

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Unsupervised pre-training is now the predominant approach for both text ...
research
06/13/2021

Memory-efficient Transformers via Top-k Attention

Following the success of dot-product attention in Transformers, numerous...
research
11/27/2020

CoRe: An Efficient Coarse-refined Training Framework for BERT

In recent years, BERT has made significant breakthroughs on many natural...

Please sign up or login with your details

Forgot password? Click here to reset