Semantically Enhanced Hard Negatives for Cross-modal Information Retrieval

10/10/2022
by   Yan Gong, et al.
9

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.

READ FULL TEXT

page 2

page 5

page 15

research
02/13/2023

CLIP-RR: Improved CLIP Network for Relation-Focused Cross-Modal Information Retrieval

Relation-focused cross-modal information retrieval focuses on retrieving...
research
02/03/2018

Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval

Cross-modal information retrieval aims to find heterogeneous data of var...
research
08/09/2021

Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

This paper presents a three-tier modality alignment approach to learning...
research
09/03/2019

Do Cross Modal Systems Leverage Semantic Relationships?

Current cross-modal retrieval systems are evaluated using R@K measure wh...
research
08/02/2021

Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

This paper introduces a two-phase deep feature calibration framework for...
research
10/22/2021

Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

This paper introduces a two-phase deep feature engineering framework for...
research
08/02/2021

Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

It is widely acknowledged that learning joint embeddings of recipes with...

Please sign up or login with your details

Forgot password? Click here to reset