Experts Weights Averaging: A New General Training Scheme for Vision Transformers

08/11/2023
by   Yongqi Huang, et al.
0

Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2022

Training Vision Transformers with Only 2040 Images

Vision Transformers (ViTs) is emerging as an alternative to convolutiona...
research
05/28/2022

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Sparsely activated transformers, such as Mixture of Experts (MoE), have ...
research
07/16/2023

Tangent Transformers for Composition, Privacy and Removal

We introduce Tangent Attention Fine-Tuning (TAFT), a method for fine-tun...
research
03/02/2023

Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

Recently, a considerable number of studies in computer vision involves d...
research
08/02/2023

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity witho...
research
05/26/2023

Do We Really Need a Large Number of Visual Prompts?

Due to increasing interest in adapting models on resource-constrained ed...
research
03/02/2023

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Despite their remarkable achievement, gigantic transformers encounter si...

Please sign up or login with your details

Forgot password? Click here to reset