VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

07/18/2017
by   Fartash Faghri, et al.
0

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by the use of hard negatives in structured prediction, and ranking loss functions used in retrieval, we introduce a simple change to common loss functions used to learn multi-modal embeddings. That, combined with fine-tuning and the use of augmented data, yields significant gains in retrieval performance. We showcase our approach, dubbed VSE++, on the MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8

READ FULL TEXT
research
10/17/2021

Contrastive Learning of Visual-Semantic Embeddings

Contrastive learning is a powerful technique to learn representations th...
research
08/12/2020

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...
research
08/08/2020

Cross-modal Center Loss

Cross-modal retrieval aims to learn discriminative and modal-invariant f...
research
10/04/2020

Multi-Modal Retrieval using Graph Neural Networks

Most real world applications of image retrieval such as Adobe Stock, whi...
research
11/30/2022

Improving Cross-Modal Retrieval with Set of Diverse Embeddings

Cross-modal retrieval across image and text modalities is a challenging ...
research
04/30/2020

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Image captioning datasets have proven useful for multimodal representati...
research
06/27/2023

UniUD Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023

In this report, we present the technical details of our submission to th...

Please sign up or login with your details

Forgot password? Click here to reset