Aggregating Nested Transformers

05/26/2021
by   Zizhao Zhang, et al.
9

Although hierarchical structures are popular in recent vision transformers, they require sophisticated designs and massive datasets to work well. In this work, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture with minor code changes upon the original vision transformer and obtains improved performance compared to existing methods. Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization. For example, a NesT with 68M parameters trained on ImageNet for 100/300 epochs achieves 82.3%/83.8% accuracy evaluated on 224× 224 image size, outperforming previous methods with up to 57% parameter reduction. Training a NesT with 6M parameters from scratch on CIFAR10 achieves 96% accuracy using a single GPU, setting a new state of the art for vision transformers. Beyond image classification, we extend the key idea to image generation and show NesT leads to a strong decoder that is 8× faster than previous transformer based generators. Furthermore, we also propose a novel method for visually interpreting the learned model.

READ FULL TEXT

page 9

page 17

research
03/31/2021

Going deeper with Image Transformers

Transformers have been recently adapted for large scale image classifica...
research
03/04/2022

Characterizing Renal Structures with 3D Block Aggregate Transformers

Efficiently quantifying renal structures can provide distinct spatial co...
research
05/30/2022

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of s...
research
12/21/2021

Learned Queries for Efficient Local Attention

Vision Transformers (ViT) serve as powerful vision models. Unlike convol...
research
04/28/2022

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

The development of the transformer-based text-to-image models are impede...
research
01/30/2022

Aggregating Global Features into Local Vision Transformer

Local Transformer-based classification models have recently achieved pro...
research
10/10/2021

NViT: Vision Transformer Compression and Parameter Redistribution

Transformers yield state-of-the-art results across many tasks. However, ...

Please sign up or login with your details

Forgot password? Click here to reset