Telling the What while Pointing the Where: Fine-grained Mouse Trace and Language Supervision for Improved Image Retrieval

02/09/2021
by   Soravit Changpinyo, et al.
0

Existing image retrieval systems use text queries to provide a natural and practical way for users to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. The textual modality can only cumbersomely express such localization preferences, whereas pointing would be a natural fit. In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") to express the characteristics of the desired target image. To this end, we learn an image retrieval model using the Localized Narratives dataset, which is capable of performing early fusion between text descriptions and synchronized mouse traces. Qualitative and quantitative experiments show that our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.

READ FULL TEXT

page 1

page 2

page 4

page 5

page 6

page 7

page 9

page 10

research
01/14/2020

Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features

Text contained in an image carries high-level semantics that can be expl...
research
06/19/2018

FineTag: Multi-label Retrieval of Attributes at Fine-grained Level in Images

In image retrieval, the features extracted from an item are used to look...
research
11/10/2019

Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

This paper explores the task of interactive image retrieval using natura...
research
05/23/2023

Mitigating Test-Time Bias for Fair Image Retrieval

We address the challenge of generating fair and unbiased image retrieval...
research
06/12/2023

Sticker820K: Empowering Interactive Retrieval with Stickers

Stickers have become a ubiquitous part of modern-day communication, conv...
research
06/30/2020

Modality-Agnostic Attention Fusion for visual search with text feedback

Image retrieval with natural language feedback offers the promise of cat...
research
11/07/2020

Text-to-Image Generation Grounded by Fine-Grained User Attention

Localized Narratives is a dataset with detailed natural language descrip...

Please sign up or login with your details

Forgot password? Click here to reset