Vision Transformer Architecture Search

06/25/2021
by   Xiu Su, et al.
5

Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce private class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with 1.3G FLOPs budget, our searched architecture achieves 74.7% top-1 accuracy on ImageNet and is 2.5% superior than the current baseline ViT architecture. Code is available at <https://github.com/xiusu/ViTAS>.

READ FULL TEXT

page 5

page 6

page 7

page 8

page 10

page 11

page 12

page 13

research
11/29/2021

Searching the Search Space of Vision Transformer

Vision Transformer has shown great visual representation power in substa...
research
05/28/2022

MDMLP: Image Classification from Scratch on Small Datasets with MLP

The attention mechanism has become a go-to technique for natural languag...
research
03/23/2022

Training-free Transformer Architecture Search

Recently, Vision Transformer (ViT) has achieved remarkable success in se...
research
09/01/2021

Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural langu...
research
05/31/2021

Memory-Efficient Differentiable Transformer Architecture Search

Differentiable architecture search (DARTS) is successfully applied in ma...
research
05/09/2021

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

After their successful debut in natural language processing, Transformer...
research
05/22/2023

VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is differen...

Please sign up or login with your details

Forgot password? Click here to reset