Residual Mixture of Experts

04/20/2022
by   Lemeng Wu, et al.
3

Mixture of Experts (MoE) is able to scale up vision transformers effectively. However, it requires prohibiting computation resources to train a large MoE transformer. In this paper, we propose Residual Mixture of Experts (RMoE), an efficient training pipeline for MoE vision transformers on downstream tasks, such as segmentation and detection. RMoE achieves comparable results with the upper-bound MoE training, while only introducing minor additional training cost than the lower-bound non-MoE training pipelines. The efficiency is supported by our key observation: the weights of an MoE transformer can be factored into an input-independent core and an input-dependent residual. Compared with the weight core, the weight residual can be efficiently trained with much less computation resource, e.g., finetuning on the downstream data. We show that, compared with the current MoE training pipeline, we get comparable results while saving over 30 MoE transformers, such as Swin-T / CvT-13 / Swin-L, we get +1.1 / 0.9 / 1.0 mIoU gain on ADE20K segmentation and +1.4 / 1.6 / 0.6 AP gain on MS-COCO object detection task with less than 3

READ FULL TEXT
research
05/17/2022

Vision Transformer Adapter for Dense Predictions

This work investigates a simple yet powerful adapter for Vision Transfor...
research
03/14/2022

Efficient Language Modeling with Sparse all-MLP

All-MLP architectures have attracted increasing interest as an alternati...
research
03/08/2022

EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers

Recently, vision transformers started to show impressive results which o...
research
09/06/2023

Combining pre-trained Vision Transformers and CIDER for Out Of Domain Detection

Out-of-domain (OOD) detection is a crucial component in industrial appli...
research
06/29/2023

An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training

We present a model that can perform multiple vision tasks and can be ada...
research
12/02/2021

Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViT) as a unifi...
research
03/02/2023

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Despite their remarkable achievement, gigantic transformers encounter si...

Please sign up or login with your details

Forgot password? Click here to reset