P2T: Pyramid Pooling Transformer for Scene Understanding

06/22/2021
by   Yu-Huan Wu, et al.
0

This paper jointly resolves two problems in vision transformer: i) the computation of Multi-Head Self-Attention (MHSA) has high computational/space complexity; ii) recent vision transformer networks are overly tuned for image classification, ignoring the difference between image classification (simple scenarios, more similar to NLP) and downstream scene understanding tasks (complicated scenarios, rich structural and contextual information). To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful context abstraction, and its natural property of spatial invariance is suitable to address the loss of structural information (problem ii)). Hence, we propose to adapt pyramid pooling to MHSA for alleviating its high requirement on computational resources (problem i)). In this way, this pooling-based MHSA can well address the above two problems and is thus flexible and powerful for downstream scene understanding tasks. Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various downstream scene understanding tasks such as semantic segmentation, object detection, instance segmentation, and visual saliency detection, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T. Note that this technical report will keep updating.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/28/2021

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Recently, Transformers have shown promising performance in various visio...
research
01/05/2022

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Multi-scale representations are crucial for semantic segmentation. The c...
research
07/11/2022

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone ...
research
09/20/2023

RMT: Retentive Networks Meet Vision Transformers

Transformer first appears in the field of natural language processing an...
research
07/26/2021

Contextual Transformer Networks for Visual Recognition

Transformer with self-attention has led to the revolutionizing of natura...
research
07/13/2022

Pyramid Transformer for Traffic Sign Detection

Traffic sign detection is a vital task in the visual system of self-driv...
research
07/02/2023

TopicFM+: Boosting Accuracy and Efficiency of Topic-Assisted Feature Matching

This study tackles the challenge of image matching in difficult scenario...

Please sign up or login with your details

Forgot password? Click here to reset