Exploiting Pseudo Image Captions for Multimodal Summarization

05/09/2023
by   Chaoya Jiang, et al.
0

Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/08/2023

Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
03/20/2023

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Multifold observations are common for different data modalities, e.g., a...
research
04/28/2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

We present an approach to learn voice-face representations from the talk...
research
11/07/2022

Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions

Modern Review Helpfulness Prediction systems are dependent upon multiple...
research
11/21/2022

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Multi-modal reasoning in visual question answering (VQA) has witnessed r...
research
06/15/2022

Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation

Diffusion probabilistic models (DPMs) have become a popular approach to ...
research
09/07/2023

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

With strong representation capabilities, pretrained vision-language mode...

Please sign up or login with your details

Forgot password? Click here to reset