So-ViT: Mind Visual Tokens for Vision Transformer

04/22/2021
by   Jiangtao Xie, et al.
0

Recently the vision transformer (ViT) architecture, where the backbone purely consists of self-attention mechanism, has achieved very promising performance in visual classification. However, the high performance of the original ViT heavily depends on pretraining using ultra large-scale datasets, and it significantly underperforms on ImageNet-1K if trained from scratch. This paper makes the efforts toward addressing this problem, by carefully considering the role of visual tokens. First, for classification head, existing ViT only exploits class token while entirely neglecting rich semantic information inherent in high-level visual tokens. Therefore, we propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. Meanwhile, a fast singular value power normalization is proposed for improving the second-order pooling. Second, the original ViT employs the naive embedding of fixed-size image patches, lacking the ability to model translation equivariance and locality. To alleviate this problem, we develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding. The proposed architecture, which we call So-ViT, is thoroughly evaluated on ImageNet-1K. The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models. Code is available at https://github.com/jiangtaoxie/So-ViT

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2021

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...
research
08/24/2022

Addressing Token Uniformity in Transformers via Singular Value Transformation

Token uniformity is commonly observed in transformer-based models, in wh...
research
07/12/2021

The Brownian motion in the transformer model

Transformer is the state of the art model for many language and visual t...
research
03/20/2023

SeiT: Storage-Efficient Vision Training with Tokens Using 1 Storage

We need billion-scale images to achieve more generalizable and ground-br...
research
06/06/2021

Transformer in Convolutional Neural Networks

We tackle the low-efficiency flaw of vision transformer caused by the hi...
research
08/13/2020

On the Importance of Local Information in Transformer Based Models

The self-attention module is a key component of Transformer-based models...
research
04/07/2023

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Human visual recognition is a sparse process, where only a few salient v...

Please sign up or login with your details

Forgot password? Click here to reset