CLIP-RR: Improved CLIP Network for Relation-Focused Cross-Modal Information Retrieval

02/13/2023
by   Yan Gong, et al.
9

Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. To date, CLIP (Contrastive Language-Image Pre-training) achieved state-of-the-art performance in cross-modal learning tasks due to its efficient learning of visual concepts from natural language supervision. However, CLIP learns visual representations from natural language at a global level without the capability of focusing on image-object relations. This paper proposes a novel CLIP-based network for Relation Reasoning, CLIP-RR, that tackles relation-focused cross-modal information retrieval. The proposed network utilises CLIP to leverage its pre-trained knowledge, and it additionally comprises two main parts: (1) extends the capabilities of CLIP to extract and reason with object relations in images; and (2) aggregates the reasoned results for predicting the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the RefCOCOg, CLEVR, and Flickr30K datasets. The results revealed that the proposed network outperformed various other state-of-the-art networks including CLIP, VSE∞, and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks.

READ FULL TEXT

page 2

page 3

page 10

page 11

research
07/26/2023

Neural-based Cross-modal Search and Retrieval of Artwork

Creating an intelligent search and retrieval system for artwork images, ...
research
10/10/2022

Semantically Enhanced Hard Negatives for Cross-modal Information Retrieval

Visual Semantic Embedding (VSE) aims to extract the semantics of images ...
research
08/11/2020

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Visual dialogue is a challenging task that needs to extract implicit inf...
research
04/20/2023

Is Cross-modal Information Retrieval Possible without Training?

Encoded representations from a pretrained deep learning model (e.g., BER...
research
07/26/2023

Boon: A Neural Search Engine for Cross-Modal Information Retrieval

Visual-Semantic Embedding (VSE) networks can help search engines better ...
research
12/23/2018

Multi-modal Learning with Prior Visual Relation Reasoning

Visual relation reasoning is a central component in recent cross-modal a...
research
05/20/2021

More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Attention mechanisms have been widely applied to cross-modal tasks such ...

Please sign up or login with your details

Forgot password? Click here to reset