Modality-Agnostic Attention Fusion for visual search with text feedback

06/30/2020
by   Eric Dodds, et al.
16

Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.

READ FULL TEXT

page 1

page 7

page 13

page 14

research
03/08/2022

Image Search with Text Feedback by Additive Attention Compositional Learning

Effective image retrieval with text feedback stands to impact a range of...
research
06/08/2021

Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback

We study the task of conversational fashion image retrieval via multitur...
research
08/09/2021

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

We extend the task of composed image retrieval, where an input query con...
research
02/09/2021

Telling the What while Pointing the Where: Fine-grained Mouse Trace and Language Supervision for Improved Image Retrieval

Existing image retrieval systems use text queries to provide a natural a...
research
04/23/2022

Training and challenging models for text-guided fashion image retrieval

Retrieving relevant images from a catalog based on a query image togethe...
research
09/03/2020

TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback

The ability to efficiently search for images over an indexed database is...
research
11/25/2019

Learning to Learn Words from Narrated Video

When we travel, we often encounter new scenarios we have never experienc...

Please sign up or login with your details

Forgot password? Click here to reset