Uncertainty-Aware Multi-View Visual Semantic Embedding

09/15/2023
by   Wenzhang Wei, et al.
0

The key challenge in image-text retrieval is effectively leveraging semantic information to measure the similarity between vision and language data. However, using instance-level binary labels, where each image is paired with a single text, fails to capture multiple correspondences between different semantic units, leading to uncertainty in multi-modal semantic understanding. Although recent research has captured fine-grained information through more complex model structures or pre-training techniques, few studies have directly modeled uncertainty of correspondence to fully exploit binary labels. To address this issue, we propose an Uncertainty-Aware Multi-View Visual Semantic Embedding (UAMVSE) framework that decomposes the overall image-text matching into multiple view-text matchings. Our framework introduce an uncertainty-aware loss function (UALoss) to compute the weighting of each view-text loss by adaptively modeling the uncertainty in each view-text correspondence. Different weightings guide the model to focus on different semantic information, enhancing the model's ability to comprehend the correspondence of images and texts. We also design an optimized image-text matching strategy by normalizing the similarity matrix to improve model performance. Experimental results on the Flicker30k and MS-COCO datasets demonstrate that UAMVSE outperforms state-of-the-art models.

READ FULL TEXT

page 1

page 4

page 8

research
06/17/2019

ParNet: Position-aware Aggregated Relation Network for Image-Text matching

Exploring fine-grained relationship between entities(e.g. objects in ima...
research
09/12/2021

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Existing research for image text retrieval mainly relies on sentence-lev...
research
09/21/2020

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

Scene text instances found in natural images carry explicit semantic inf...
research
03/08/2020

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Enabling bi-directional retrieval of images and texts is important for u...
research
05/31/2023

Boosting Text-to-Image Diffusion Models with Fine-Grained Semantic Rewards

Recent advances in text-to-image diffusion models have achieved remarkab...
research
05/09/2021

Beyond Monocular Deraining: Parallel Stereo Deraining Network Via Semantic Prior

Rain is a common natural phenomenon. Taking images in the rain however o...
research
08/16/2023

Ranking-aware Uncertainty for Text-guided Image Retrieval

Text-guided image retrieval is to incorporate conditional text to better...

Please sign up or login with your details

Forgot password? Click here to reset