Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks

10/10/2022
by   Pedro Rodriguez, et al.
0

Searching vast troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) all other videos as negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which creates false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25 benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25 models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated by choosing evaluation sizes corresponding to desired effect size to detect. We recommend retiring these benchmarks in their current form and make recommendations for future text-to-video retrieval benchmarks.

READ FULL TEXT

page 7

page 15

page 16

page 17

page 20

research
03/18/2021

On Semantic Similarity in Video Retrieval

Current video retrieval efforts all found their evaluation on an instanc...
research
03/29/2023

Hierarchical Video-Moment Retrieval and Step-Captioning

There is growing interest in searching for information from large video ...
research
03/16/2022

Learning video retrieval models with relevance-aware online mining

Due to the amount of videos and related captions uploaded every hour, de...
research
01/20/2022

End-to-end Generative Pretraining for Multimodal Video Captioning

Recent video and language pretraining frameworks lack the ability to gen...
research
04/01/2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Our objective in this work is video-text retrieval - in particular a joi...
research
03/09/2023

Improving Video Retrieval by Adaptive Margin

Video retrieval is becoming increasingly important owing to the rapid em...
research
06/23/2016

VideoMCC: a New Benchmark for Video Comprehension

While there is overall agreement that future technology for organizing, ...

Please sign up or login with your details

Forgot password? Click here to reset