ICSVR: Investigating Compositional and Semantic Understanding in Video Retrieval Models

06/28/2023
by   Avinash Madasu, et al.
0

Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct semantics to form a proper text query. These components (objects & attributes, actions and semantics) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and semantic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and semantics play a minor role compared to objects & attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better semantic and compositional understanding as compared to models pre-trained on video-text data.

READ FULL TEXT

page 6

page 7

research
11/17/2022

Cross-Modal Adapter for Text-Video Retrieval

Text-video retrieval is an important multi-modal learning task, where th...
research
06/25/2021

Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair

In this paper we undertake the task of text-based video moment retrieval...
research
07/24/2023

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

Text-to-video retrieval systems have recently made significant progress ...
research
01/19/2023

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

State-of-the-art video-text retrieval (VTR) methods usually fully fine-t...
research
02/24/2021

A Straightforward Framework For Video Retrieval Using CLIP

Video Retrieval is a challenging task where a text query is matched to a...
research
02/13/2020

CBIR using features derived by Deep Learning

In a Content Based Image Retrieval (CBIR) System, the task is to retriev...
research
03/14/2022

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

In this work we present a new State-of-The-Art on the text-to-video retr...

Please sign up or login with your details

Forgot password? Click here to reset