Use What You Have: Video Retrieval Using Representations From Collaborative Experts

07/31/2019
by   Yang Liu, et al.
3

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing `specific details' such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pretrained semantic embeddings which include `general' features such as motion, appearance, and scene features from visual content, and more `specific' cues from ASR and OCR which may not always be available, but allow for more fine-grained disambiguation when present. We propose a collaborative experts model to aggregate information effectively from these different pretrained experts. The effectiveness of our approach is demonstrated empirically, setting new state-of-the-art performances on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet, while simultaneously reducing the number of parameters used by prior work. Code and data can be found at www.robots.ox.ac.uk/ vgg/research/collaborative-experts/.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 7

page 10

research
07/21/2020

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language querie...
research
04/13/2018

Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

We propose a novel method capable of retrieving clips from untrimmed vid...
research
01/30/2023

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Vision-language alignment learning for video-text retrieval arouses a lo...
research
08/22/2023

Multi-event Video-Text Retrieval

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of ma...
research
04/07/2018

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area wit...
research
05/08/2023

Joint Moment Retrieval and Highlight Detection Via Natural Language Queries

Video summarization has become an increasingly important task in the fie...
research
06/07/2023

MarineVRS: Marine Video Retrieval System with Explainability via Semantic Understanding

Building a video retrieval system that is robust and reliable, especiall...

Please sign up or login with your details

Forgot password? Click here to reset