3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

08/08/2023
by   Ziyu Zhu, et al.
0

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.

READ FULL TEXT

page 8

page 13

page 14

research
05/24/2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Large-scale pretrained foundation models have been an emerging paradigm ...
research
04/26/2021

MDETR – Modulated Detection for End-to-End Multi-Modal Understanding

Multi-modal reasoning systems rely on a pre-trained object detector to e...
research
03/22/2022

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

Compared with the domain-specific model, the vision-language pre-trainin...
research
12/13/2020

MiniVLM: A Smaller and Faster Vision-Language Model

Recent vision-language (VL) studies have shown remarkable progress by le...
research
05/23/2023

Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

We present Cross3DVG, a novel task for cross-dataset visual grounding in...
research
01/11/2023

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Self-supervised learning in vision-language processing exploits semantic...
research
09/15/2022

PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer

Lane detection is one of the fundamental modules in self-driving. In thi...

Please sign up or login with your details

Forgot password? Click here to reset