Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

08/09/2021
by   Zheyuan Liu, et al.
0

We extend the task of composed image retrieval, where an input query consists of an image and short textual description of how to modify the image. Existing methods have only been applied to non-complex images within narrow domains, such as fashion products, thereby limiting the scope of study on in-depth visual reasoning in rich image and language contexts. To address this issue, we collect the Compose Image Retrieval on Real-life images (CIRR) dataset, which consists of over 36,000 pairs of crowd-sourced, open-domain images with human-generated modifying text. To extend current methods to the open-domain, we propose CIRPLANT, a transformer based model that leverages rich pre-trained vision-and-language (V L) knowledge for modifying visual features conditioned on natural language. Retrieval is then done by nearest neighbor lookup on the modified features. We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion. Together with the release of CIRR, we believe this work will inspire further research on composed image retrieval.

READ FULL TEXT

page 4

page 7

page 12

page 13

page 16

page 17

page 18

page 19

research
08/22/2023

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Given a query composed of a reference image and a relative caption, the ...
research
06/12/2023

Sticker820K: Empowering Interactive Retrieval with Stickers

Stickers have become a ubiquitous part of modern-day communication, conv...
research
06/08/2021

Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback

We study the task of conversational fashion image retrieval via multitur...
research
06/30/2020

Modality-Agnostic Attention Fusion for visual search with text feedback

Image retrieval with natural language feedback offers the promise of cat...
research
05/25/2023

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Composed image retrieval aims to find an image that best matches a given...
research
11/24/2022

Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark

The evaluation of object detection models is usually performed by optimi...
research
04/29/2022

Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image Retrieval

Current text-image approaches (e.g., CLIP) typically adopt dual-encoder ...

Please sign up or login with your details

Forgot password? Click here to reset