MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

04/15/2022
by   Simiao Zuo, et al.
0

Pre-trained language models have demonstrated superior performance in various natural language processing tasks. However, these models usually contain hundreds of millions of parameters, which limits their practicality because of latency requirements in real-world applications. Existing methods train small compressed models via knowledge distillation. However, performance of these small models drops significantly compared with the pre-trained models due to their reduced model capacity. We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We initialize MoEBERT by adapting the feed-forward neural networks in a pre-trained model into multiple experts. As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved. We also propose a layer-wise distillation method to train MoEBERT. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks. Results show that the proposed method outperforms existing task-specific distillation algorithms. For example, our method outperforms previous approaches by over 2 publicly available at https://github.com/SimiaoZuo/MoEBERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has ma...
research
06/25/2022

PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance

Large Transformer-based models have exhibited superior performance in va...
research
04/03/2023

MiniRBT: A Two-stage Distilled Small Chinese Pre-trained Model

In natural language processing, pre-trained language models have become ...
research
06/08/2021

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

While deep and large pre-trained models are the state-of-the-art for var...
research
05/30/2021

Defending Pre-trained Language Models from Adversarial Word Substitutions Without Performance Sacrifice

Pre-trained contextualized language models (PrLMs) have led to strong pe...
research
01/27/2022

TableQuery: Querying tabular data with natural language

This paper presents TableQuery, a novel tool for querying tabular data u...
research
12/16/2022

Decoder Tuning: Efficient Language Understanding as Decoding

With the evergrowing sizes of pre-trained models (PTMs), it has been an ...

Please sign up or login with your details

Forgot password? Click here to reset