Scalable Visual Transformers with Hierarchical Pooling

03/19/2021
by   Zizheng Pan, et al.
0

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.

READ FULL TEXT
research
01/28/2021

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...
research
11/19/2018

Reducing Visual Confusion with Discriminative Attention

Recent developments in gradient-based attention modeling have led to imp...
research
04/13/2023

TransHP: Image Classification with Hierarchical Prompting

This paper explores a hierarchical prompting mechanism for the hierarchi...
research
06/19/2023

RaViTT: Random Vision Transformer Tokens

Vision Transformers (ViTs) have successfully been applied to image class...
research
05/17/2023

CageViT: Convolutional Activation Guided Efficient Vision Transformer

Recently, Transformers have emerged as the go-to architecture for both v...
research
10/01/2022

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Recognizing an image and segmenting it into coherent regions are often t...
research
12/08/2022

Group Generalized Mean Pooling for Vision Transformer

Vision Transformer (ViT) extracts the final representation from either c...

Please sign up or login with your details

Forgot password? Click here to reset