Cross-Modal Contrastive Learning for Robust Reasoning in VQA

11/21/2022
by   Qi Zheng, et al.
0

Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. However, most reasoning models heavily rely on shortcuts learned from training data, which prevents their usage in challenging real-world scenarios. In this paper, we propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning caused by imbalanced annotations and improve the overall performance. Different from existing contrastive learning with complex negative categories on coarse (Image, Question, Answer) triplet level, we leverage the correspondences between the language and image modalities to perform finer-grained cross-modal contrastive learning. We treat each Question-Answer (QA) pair as a whole, and differentiate between images that conform with it and those against it. To alleviate the issue of sampling bias, we further build connected graphs among images. For each positive pair, we regard the images from different graphs as negative samples and deduct the version of multi-positive contrastive learning. To our best knowledge, it is the first paper that reveals a general contrastive learning strategy without delicate hand-craft rules can contribute to robust VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our superiority compared to the state of the arts. Code is available at <https://github.com/qizhust/cmcl_vqa_pl>.

READ FULL TEXT

page 3

page 6

page 7

page 12

page 13

research
03/20/2023

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Multifold observations are common for different data modalities, e.g., a...
research
08/24/2022

UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering

Visual question answering (VQA) that leverages multi-modality data has a...
research
02/27/2023

Contrastive Video Question Answering via Video Graph Transformer

We propose to perform video question answering (VideoQA) in a Contrastiv...
research
06/16/2020

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based VisualQuestion Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...
research
03/27/2023

Curriculum Learning for Compositional Visual Reasoning

Visual Question Answering (VQA) is a complex task requiring large datase...
research
07/12/2022

Video Graph Transformer for Video Question Answering

This paper proposes a Video Graph Transformer (VGT) model for Video Quet...
research
05/09/2023

Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) fa...

Please sign up or login with your details

Forgot password? Click here to reset