Scalable Visual Transformers with Hierarchical Pooling

by   Zizheng Pan, et al.

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.


Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...

Reducing Visual Confusion with Discriminative Attention

Recent developments in gradient-based attention modeling have led to imp...

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scal...

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

In this paper, we present a new approach for model acceleration by explo...

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved great success using standardized image repr...

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

This paper presents new hierarchically cascaded transformers that can im...