MoEfication: Conditional Computation of Transformer Models for Efficient Inference

10/05/2021
by   Zhengyan Zhang, et al.
0

Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. Fortunately, we find by empirical study that, most inputs only activate a tiny ratio of neurons during inference. Hence, we explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication. Model MoEfication consists of two steps: (1) splitting the parameters of feed-forward neural networks (FFNs) into multiple parts as experts, and (2) building expert routers to decide which experts will be used for each input. To further improve the performance of MoEfied models, we can also fine-tune the models on downstream tasks, namely parameter calibration. Experimental results show that the MoEfied models can significantly reduce computation cost, e.g., only activating 20 performance degradation on several downstream tasks including text classification and reading comprehension.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2023

PVP: Pre-trained Visual Parameter-Efficient Tuning

Large-scale pre-trained transformers have demonstrated remarkable succes...
research
05/15/2020

Finding Experts in Transformer Models

In this work we study the presence of expert units in pre-trained Transf...
research
04/13/2022

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

The remarkable success of large transformer-based models such as BERT, R...
research
04/10/2019

Soft Conditional Computation

Conditional computation aims to increase the size and accuracy of a netw...
research
06/01/2022

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pr...
research
06/09/2022

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

To build an artificial neural network like the biological intelligence s...
research
07/11/2022

Learning Large-scale Universal User Representation with Sparse Mixture of Experts

Learning user sequence behaviour embedding is very sophisticated and cha...

Please sign up or login with your details

Forgot password? Click here to reset