Zero-shot Video Moment Retrieval With Off-the-Shelf Models

11/03/2022
by   Anuj Diwan, et al.
0

For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74 also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.

READ FULL TEXT
research
03/24/2022

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Large-scale pretrained image-text models have shown incredible zero-shot...
research
03/11/2022

LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

In this paper, we propose LaPraDoR, a pretrained dual-tower dense retrie...
research
04/12/2022

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Training a referring expression comprehension (ReC) model for a new visu...
research
07/23/2023

Geometry-Aware Adaptation for Pretrained Models

Machine learning models – including prominent zero-shot models – are oft...
research
04/06/2023

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Adopting contrastive image-text pretrained models like CLIP towards vide...
research
05/13/2023

Zero-shot Faithful Factual Error Correction

Faithfully correcting factual errors is critical for maintaining the int...
research
06/13/2023

GeneCIS: A Benchmark for General Conditional Image Similarity

We argue that there are many notions of 'similarity' and that models, li...

Please sign up or login with your details

Forgot password? Click here to reset