Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

03/15/2022
by   Hanrong Ye, et al.
2

Multi-task dense scene understanding is a thriving research domain that requires simultaneous perception and reasoning on a series of correlated tasks with pixel-wise prediction. Most existing works encounter a severe limitation of modeling in the locality due to heavy utilization of convolution operations, while learning interactions and inference in a global spatial-position and multi-task context is critical for this problem. In this paper, we propose a novel end-to-end Inverted Pyramid multi-task (InvPT) Transformer to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. To the best of our knowledge, this is the first work that explores designing a transformer structure for multi-task dense prediction for scene understanding. Besides, it is widely demonstrated that a higher spatial resolution is remarkably beneficial for dense predictions, while it is very challenging for existing transformers to go deeper with higher resolutions due to huge complexity to large spatial size. InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts. Code and trained models will be publicly available.

READ FULL TEXT

page 10

page 11

page 13

page 14

page 19

page 20

page 21

page 22

research
06/08/2023

InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

Multi-task scene understanding aims to design models that can simultaneo...
research
07/16/2023

Contrastive Multi-Task Dense Prediction

This paper targets the problem of multi-task dense prediction which aims...
research
01/09/2023

DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction

Convolution neural networks (CNNs) and Transformers have their own advan...
research
07/28/2023

Prompt Guided Transformer for Multi-Task Dense Prediction

Task-conditional architecture offers advantage in parameter efficiency b...
research
08/10/2023

Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

CNNs and Transformers have their own advantages and both have been widel...
research
07/11/2023

PIGEON: Predicting Image Geolocations

We introduce PIGEON, a multi-task end-to-end system for planet-scale ima...
research
06/08/2023

Efficient Multi-Task Scene Analysis with RGB-D Transformers

Scene analysis is essential for enabling autonomous systems, such as mob...

Please sign up or login with your details

Forgot password? Click here to reset