CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

04/12/2023
by   Maria Parelli, et al.
0

Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings produced by CLIP. To assess our model's 3D world reasoning capability, we evaluate it on the downstream task of 3D Visual Question Answering. Experimental quantitative and qualitative results show that our pre-training method outperforms state-of-the-art works in this task and leads to an interpretable representation of 3D scene features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2023

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

Training models to apply common-sense linguistic knowledge and visual co...
research
09/09/2022

Pre-training image-language transformers for open-vocabulary tasks

We present a pre-training approach for vision and language transformer m...
research
05/04/2023

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

We propose a new two-stage pre-training framework for video-to-text gene...
research
09/27/2020

Unsupervised Pre-training for Biomedical Question Answering

We explore the suitability of unsupervised representation learning metho...
research
01/11/2023

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

We present a new pre-training method, Multimodal Inverse Cloze Task, for...
research
12/27/2021

Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Contrastive Language–Image Pre-training (CLIP) has shown remarkable succ...
research
11/16/2021

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Most existing methods in vision language pre-training rely on object-cen...

Please sign up or login with your details

Forgot password? Click here to reset