Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

06/04/2023
by   Alexandros Delitzas, et al.
0

Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. However, it still remains understudied whether 2D distilled knowledge can provide useful representations for downstream 3D vision-language tasks such as 3D question answering. In this paper, we propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations. We leverage the representational power of the CLIP model by maximizing the agreement between the encoded 3D scene features and the corresponding 2D multi-view image and text embeddings in the CLIP space via a contrastive objective. To validate our approach, we consider the challenging downstream tasks of 3D Visual Question Answering (3D-VQA) and 3D Situated Question Answering (3D-SQA). To this end, we develop novel multi-modal transformer-based architectures and we demonstrate how our pre-training method can benefit their performance. Quantitative and qualitative experimental results show that Multi-CLIP outperforms state-of-the-art works across the downstream tasks of 3D-VQA and 3D-SQA and leads to a well-structured 3D scene feature space.

READ FULL TEXT
research
04/12/2023

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Training models to apply linguistic knowledge and visual concepts from 2...
research
01/22/2023

Champion Solution for the WSDM2023 Toloka VQA Challenge

In this report, we present our champion solution to the WSDM2023 Toloka ...
research
06/25/2021

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Vision-Language Pre-training (VLP) aims to learn multi-modal representat...
research
04/30/2021

Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads

Vision-and-Language (VL) pre-training has shown great potential on many ...
research
03/22/2021

How to Design Sample and Computationally Efficient VQA Models

In multi-modal reasoning tasks, such as visual question answering (VQA),...
research
03/20/2023

3D Concept Learning and Reasoning from Multi-View Images

Humans are able to accurately reason in 3D by gathering multi-view obser...
research
01/26/2022

Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest inc...

Please sign up or login with your details

Forgot password? Click here to reset