Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

07/17/2020
by   Haoran Wang, et al.
3

Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations, thereby exploiting their matching relationships and making the corresponding alignments. Such approaches only exploit the superficial associations contained in the instance pairwise data, with no consideration of any external commonsense knowledge, which may hinder their capabilities to reason the higher-level relationships between image and text. In this paper, we propose a Consensus-aware Visual-Semantic Embedding (CVSE) model to incorporate the consensus information, namely the commonsense knowledge shared between both modalities, into image-text matching. Specifically, the consensus information is exploited by computing the statistical co-occurrence correlations between the semantic concepts from the image captioning corpus and deploying the constructed concept correlation graph to yield the consensus-aware concept (CAC) representations. Afterwards, CVSE learns the associations and alignments between image and text based on the exploited consensus as well as the instance-level representations for both modalities. Extensive experiments conducted on two public datasets verify that the exploited consensus makes significant contributions to constructing more meaningful visual-semantic embeddings, with the superior performances over the state-of-the-art approaches on the bidirectional image and text retrieval task. Our code of this paper is available at: https://github.com/BruceW91/CVSE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/04/2022

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

Grounding temporal video segments described in natural language queries ...
research
11/17/2022

Visual Commonsense-aware Representation Network for Video Captioning

Generating consecutive descriptions for videos, i.e., Video Captioning, ...
research
09/06/2019

Visual Semantic Reasoning for Image-Text Matching

Image-text matching has been a hot research topic bridging the vision an...
research
08/06/2021

Interpretable Visual Understanding with Cognitive Attention Network

While image understanding on recognition-level has achieved remarkable a...
research
08/23/2023

CgT-GAN: CLIP-guided Text GAN for Image Captioning

The large-scale visual-language pre-trained model, Contrastive Language-...
research
11/15/2017

Dual-Path Convolutional Image-Text Embedding

This paper considers the task of matching images and sentences. The chal...
research
03/01/2023

The style transformer with common knowledge optimization for image-text retrieval

Image-text retrieval which associates different modalities has drawn bro...

Please sign up or login with your details

Forgot password? Click here to reset