Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

04/30/2020
by   Zarana Parekh, et al.
0

Image captioning datasets have proven useful for multimodal representation learning, and a common evaluation paradigm based on multimodal retrieval has emerged. Unfortunately, datasets have only limited cross-modal associations: images are not paired with others, captions are only paired with others that describe the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines retrieval evaluation and limits research into how inter-modality learning impacts intra-modality tasks. To address this gap, we create the Crisscrossed Captions (CxC) dataset, extending MS-COCO with new semantic similarity judgments for 247,315 intra- and inter-modality pairs. We provide baseline model performance results for both retrieval and correlations with human rankings, emphasizing both intra- and inter-modality learning.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 10

research
07/19/2018

Revisiting Cross Modal Retrieval

This paper proposes a cross-modal retrieval system that leverages on ima...
research
05/28/2021

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Despite the achievements of large-scale multimodal pre-training approach...
research
04/13/2023

Noisy Correspondence Learning with Meta Similarity Correction

Despite the success of multimodal learning in cross-modal retrieval task...
research
04/07/2022

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO

Image-Test matching (ITM) is a common task for evaluating the quality of...
research
05/24/2022

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Multimodal learning from document data has achieved great success lately...
research
05/11/2023

Continual Vision-Language Representation Learning with Off-Diagonal Information

This paper discusses the feasibility of continuously training the CLIP m...
research
07/18/2017

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

We present a new technique for learning visual-semantic embeddings for c...

Please sign up or login with your details

Forgot password? Click here to reset