Task-Specific Expert Pruning for Sparse Mixture-of-Experts

06/01/2022
by   Tianyu Chen, et al.
0

The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream tasks, such sparse structure has to sacrifice a lot of computing efficiency for limited performance gains. In this work, we observe most experts contribute scarcely little to the MoE fine-tuning and inference. We further propose a general method to progressively drop the non-professional experts for the target downstream task, which preserves the benefits of MoE while reducing the MoE model into one single-expert dense model. Our experiments reveal that the fine-tuned single-expert model could preserve 99.3 benefits from MoE across six different types of tasks while enjoying 2x inference speed with free communication cost.

READ FULL TEXT
research
04/20/2022

On the Representation Collapse of Sparse Mixture of Experts

Sparse mixture of experts provides larger model capacity while requiring...
research
12/14/2021

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Pre-trained Language Models (PLMs) have achieved great success in variou...
research
07/19/2022

MoEC: Mixture of Expert Clusters

Sparsely Mixture of Experts (MoE) has received great interest due to its...
research
10/05/2021

MoEfication: Conditional Computation of Transformer Models for Efficient Inference

Transformer-based pre-trained language models can achieve superior perfo...
research
03/02/2023

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Despite their remarkable achievement, gigantic transformers encounter si...
research
08/02/2023

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity witho...
research
12/14/2020

COAD: Contrastive Pre-training with Adversarial Fine-tuning for Zero-shot Expert Linking

Expert finding, a popular service provided by many online websites such ...

Please sign up or login with your details

Forgot password? Click here to reset