Training and challenging models for text-guided fashion image retrieval

04/23/2022
by   Eric Dodds, et al.
12

Retrieving relevant images from a catalog based on a query image together with a modifying caption is a challenging multimodal task that can particularly benefit domains like apparel shopping, where fine details and subtle variations may be best expressed through natural language. We introduce a new evaluation dataset, Challenging Fashion Queries (CFQ), as well as a modeling approach that achieves state-of-the-art performance on the existing Fashion IQ (FIQ) dataset. CFQ complements existing benchmarks by including relative captions with positive and negative labels of caption accuracy and conditional image similarity, where others provided only positive labels with a combined meaning. We demonstrate the importance of multimodal pretraining for the task and show that domain-specific weak supervision based on attribute labels can augment generic large-scale pretraining. While previous modality fusion mechanisms lose the benefits of multimodal pretraining, we introduce a residual attention fusion mechanism that improves performance. We release CFQ and our code to the research community.

READ FULL TEXT

page 1

page 4

page 5

page 15

page 16

page 17

research
08/24/2021

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representat...
research
05/30/2019

The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback

We contribute a new dataset and a novel method for natural language base...
research
06/08/2021

Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback

We study the task of conversational fashion image retrieval via multitur...
research
08/22/2022

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

A big convergence of language, vision, and multimodal pretraining is eme...
research
06/01/2023

End-to-end Knowledge Retrieval with Multi-modal Queries

We investigate knowledge retrieval with multi-modal queries, i.e. querie...
research
06/30/2020

Modality-Agnostic Attention Fusion for visual search with text feedback

Image retrieval with natural language feedback offers the promise of cat...
research
10/12/2022

That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data

Pretraining multimodal models on Electronic Health Records (EHRs) provid...

Please sign up or login with your details

Forgot password? Click here to reset