Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

06/03/2023
by   Xu Zhang, et al.
0

Language-guided image retrieval enables users to search for images and interact with the retrieval system more naturally and expressively by using a reference image and a relative caption as a query. Most existing studies mainly focus on designing image-text composition architecture to extract discriminative visual-linguistic relations. Despite great success, we identify an inherent problem that obstructs the extraction of discriminative features and considerably compromises model training: triplet ambiguity. This problem stems from the annotation process wherein annotators view only one triplet at a time. As a result, they often describe simple attributes, such as color, while neglecting fine-grained details like location and style. This leads to multiple false-negative candidates matching the same modification text. We propose a novel Consensus Network (Css-Net) that self-adaptively learns from noisy triplets to minimize the negative effects of triplet ambiguity. Inspired by the psychological finding that groups perform better than individuals, Css-Net comprises 1) a consensus module featuring four distinct compositors that generate diverse fused image-text embeddings and 2) a Kullback-Leibler divergence loss, which fosters learning among the compositors, enabling them to reduce biases learned from noisy triplets and reach a consensus. The decisions from four compositors are weighted during evaluation to further achieve consensus. Comprehensive experiments on three datasets demonstrate that Css-Net can alleviate triplet ambiguity, achieving competitive performance on benchmarks, such as +2.77% R@10 and +6.67% R@50 on FashionIQ.

READ FULL TEXT

page 1

page 4

page 11

research
03/15/2023

A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval

Content-based image retrieval is the process of retrieving a subset of i...
research
09/28/2022

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

There are two popular loss functions used for vision-language retrieval,...
research
04/21/2022

Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval

Remote sensing (RS) cross-modal text-image retrieval has attracted exten...
research
09/04/2023

Target-Guided Composed Image Retrieval

Composed image retrieval (CIR) is a new and flexible image retrieval par...
research
08/19/2023

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Text-to-image person re-identification (TIReID) is a compelling topic in...
research
04/28/2021

On the Unreasonable Effectiveness of Centroids in Image Retrieval

Image retrieval task consists of finding similar images to a query image...
research
05/19/2021

Combating Ambiguity for Hash-code Learning in Medical Instance Retrieval

When encountering a dubious diagnostic case, medical instance retrieval ...

Please sign up or login with your details

Forgot password? Click here to reset