CoVR: Learning Composed Video Retrieval from Web Video Captions

08/28/2023
by   Lucas Ventura, et al.
0

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/ ventural/covr.

READ FULL TEXT

page 19

page 27

page 28

page 30

page 31

page 32

page 33

page 34

research
09/16/2023

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

Large-scale noisy web image-text datasets have been proven to be efficie...
research
06/07/2019

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning text-video embeddings usually requires a dataset of video clips...
research
11/11/2021

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

NLP researchers need more, higher-quality text datasets. Human-labeled d...
research
02/06/2023

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

In Composed Image Retrieval (CIR), a user combines a query image with te...
research
06/12/2023

Zero-shot Composed Text-Image Retrieval

In this paper, we consider the problem of composed image retrieval (CIR)...
research
12/31/2022

Translating Text Synopses to Video Storyboards

A storyboard is a roadmap for video creation which consists of shot-by-s...
research
05/02/2018

Images & Recipes: Retrieval in the cooking context

Recent advances in the machine learning community allowed different use ...

Please sign up or login with your details

Forgot password? Click here to reset