Learning video retrieval models with relevance-aware online mining

03/16/2022
by   Alex Falcon, et al.
0

Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is enforced with all the other captions, called negatives. This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized. To address this shortcoming, we propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives. We explore the influence of these techniques on two video-text datasets: EPIC-Kitchens-100 and MSR-VTT. By using the proposed techniques, we achieve considerable improvements in terms of nDCG and mAP, leading to state-of-the-art results, e.g. +5.3 We share code and pretrained models at <https://github.com/aranciokov/ranp>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/03/2022

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Every hour, huge amounts of visual contents are posted on social media a...
research
04/27/2022

Relevance-based Margin for Contrastively-trained Video Retrieval Models

Video retrieval using natural language queries has attracted increasing ...
research
10/09/2022

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

In this paper, we re-examine the task of cross-modal clip-sentence retri...
research
10/10/2022

Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks

Searching vast troves of videos with textual descriptions is a core mult...
research
07/26/2022

Unsupervised Contrastive Learning of Image Representations from Ultrasound Videos with Hard Negative Mining

Rich temporal information and variations in viewpoints make video data a...
research
11/18/2020

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Neuro-symbolic representations have proved effective in learning structu...
research
03/17/2023

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Existing text-video retrieval solutions are, in essence, discriminant mo...

Please sign up or login with your details

Forgot password? Click here to reset