VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

by   Fartash Faghri, et al.

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by the use of hard negatives in structured prediction, and ranking loss functions used in retrieval, we introduce a simple change to common loss functions used to learn multi-modal embeddings. That, combined with fine-tuning and the use of augmented data, yields significant gains in retrieval performance. We showcase our approach, dubbed VSE++, on the MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8


Contrastive Learning of Visual-Semantic Embeddings

Contrastive learning is a powerful technique to learn representations th...

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...

Cross-modal Center Loss

Cross-modal retrieval aims to learn discriminative and modal-invariant f...

Multi-Modal Retrieval using Graph Neural Networks

Most real world applications of image retrieval such as Adobe Stock, whi...

Improving Cross-Modal Retrieval with Set of Diverse Embeddings

Cross-modal retrieval across image and text modalities is a challenging ...

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Image captioning datasets have proven useful for multimodal representati...

UniUD Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023

In this report, we present the technical details of our submission to th...

Code Repositories


PyTorch Code for the paper "VSE++: Improving Visual-Semantic Embeddings with Hard Negatives"

view repo

Please sign up or login with your details

Forgot password? Click here to reset