Incorporating Convolution Designs into Visual Transformers

03/22/2021
by   Kun Yuan, et al.
0

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new Convolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an Image-to-Tokens (I2T) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a Locally-enhanced Feed-Forward (LeFF) layer that promotes the correlation among neighboring tokens in the spatial dimension; 3) a Layer-wise Class token Attention (LCA) is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with 3× fewer training iterations, which can reduce the training cost significantly[Code and models will be released upon acceptance.].

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2021

On the Adversarial Robustness of Visual Transformers

Following the success in advancing natural language processing and under...
research
05/17/2021

Rethinking the Design Principles of Robust Vision Transformer

Recent advances on Vision Transformers (ViT) have shown that self-attent...
research
06/21/2022

Vicinity Vision Transformer

Vision transformers have shown great success on numerous computer vision...
research
05/28/2021

KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for ...
research
06/02/2021

Container: Context Aggregation Network

Convolutional neural networks (CNNs) are ubiquitous in computer vision, ...
research
08/30/2021

Exploring and Improving Mobile Level Vision Transformers

We study the vision transformer structure in the mobile level in this pa...
research
04/16/2022

Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks

Despite the exciting performance, Transformer is criticized for its exce...

Please sign up or login with your details

Forgot password? Click here to reset