Efficient Multi-Task Scene Analysis with RGB-D Transformers

Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.

READ FULL TEXT

page 1

page 3

page 4

page 7

page 9

research
10/06/2022

Robust Double-Encoder Network for RGB-D Panoptic Segmentation

Perception is crucial for robots that act in real-world environments, as...
research
11/01/2019

Centroid-Based Scene Classification (CBSC): Using Deep Features and Clustering for RGB-D Indoor Scene Classification

This paper contributes a novel method for RGB-D indoor scene classificat...
research
12/08/2022

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Action recognition models have achieved impressive results by incorporat...
research
09/27/2022

Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents

The perception system in personalized mobile agents requires developing ...
research
08/15/2019

To complete or to estimate, that is the question: A Multi-Task Approach to Depth Completion and Monocular Depth Estimation

Robust three-dimensional scene understanding is now an ever-growing area...
research
09/12/2022

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Transformers have revolutionized vision and natural language processing ...
research
03/15/2022

Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

Multi-task dense scene understanding is a thriving research domain that ...

Please sign up or login with your details

Forgot password? Click here to reset