VicTR: Video-conditioned Text Representations for Activity Recognition

04/05/2023
by   Kumara Kahatapitiya, et al.
5

Vision-Language models have shown strong performance in the image-domain – even in zero-shot settings, thanks to the availability of large amount of pretraining data (i.e., paired image-text examples). However for videos, such paired data is not as abundant. Thus, video-text models are usually designed by adapting pretrained image-text models to video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image -> video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue that such adapted video-text models can benefit more by augmenting text rather than visual information. We propose VicTR, which jointly-optimizes text and video tokens, generating 'Video-conditioned Text' embeddings. Our method can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g., object or scene information). We conduct experiments on multiple benchmarks including supervised (Kinetics-400, Charades), zero-shot and few-shot (HMDB-51, UCF-101) settings, showing competitive performance on activity recognition based on video-text models.

READ FULL TEXT
research
06/13/2023

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Contrastive visual language pretraining has emerged as a powerful method...
research
06/16/2023

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Recognizing the activities, causing distraction, in real-world driving s...
research
06/10/2023

EventCLIP: Adapting CLIP for Event-based Object Recognition

Recent advances in 2D zero-shot and few-shot recognition often leverage ...
research
10/08/2019

AutoML using Metadata Language Embeddings

As a human choosing a supervised learning algorithm, it is natural to be...
research
04/06/2023

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Adopting contrastive image-text pretrained models like CLIP towards vide...
research
01/05/2023

Test of Time: Instilling Video-Language Models with a Sense of Time

Modeling and understanding time remains a challenge in contemporary vide...
research
05/23/2023

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

In the paradigm of AI-generated content (AIGC), there has been increasin...

Please sign up or login with your details

Forgot password? Click here to reset