In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

09/16/2023
by   Nina Shvetsova, et al.
0

Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.

READ FULL TEXT

page 17

page 18

page 20

page 22

page 23

page 24

page 25

page 26

research
04/24/2020

ST^2: Small-data Text Style Transfer via Multi-task Meta-Learning

Text style transfer aims to paraphrase a sentence in one style into anot...
research
08/28/2023

CoVR: Learning Composed Video Retrieval from Web Video Captions

Composed Image Retrieval (CoIR) has recently gained popularity as a task...
research
11/17/2021

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid ar...
research
11/04/2020

Graph Based Temporal Aggregation for Video Retrieval

Large scale video retrieval is a field of study with a lot of ongoing re...
research
03/14/2022

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

In this work we present a new State-of-The-Art on the text-to-video retr...
research
02/24/2021

A Straightforward Framework For Video Retrieval Using CLIP

Video Retrieval is a challenging task where a text query is matched to a...
research
07/25/2021

Transcript to Video: Efficient Clip Sequencing from Texts

Among numerous videos shared on the web, well-edited ones always attract...

Please sign up or login with your details

Forgot password? Click here to reset