InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

06/08/2023
by   Hanrong Ye, et al.
8

Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.

READ FULL TEXT

page 1

page 3

page 8

page 9

page 10

page 11

page 12

page 14

research
03/15/2022

Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

Multi-task dense scene understanding is a thriving research domain that ...
research
09/06/2022

Sequential Cross Attention Based Multi-task Learning

In multi-task learning (MTL) for visual scene understanding, it is cruci...
research
07/16/2023

Contrastive Multi-Task Dense Prediction

This paper targets the problem of multi-task dense prediction which aims...
research
04/12/2022

Medusa: Universal Feature Learning via Attentional Multitasking

Recent approaches to multi-task learning (MTL) have focused on modelling...
research
11/25/2022

Aggregated Text Transformer for Scene Text Detection

This paper explores the multi-scale aggregation strategy for scene text ...
research
01/28/2022

Global-Reasoned Multi-Task Learning Model for Surgical Scene Understanding

Global and local relational reasoning enable scene understanding models ...
research
06/23/2023

Upscaling Global Hourly GPP with Temporal Fusion Transformer (TFT)

Reliable estimates of Gross Primary Productivity (GPP), crucial for eval...

Please sign up or login with your details

Forgot password? Click here to reset