Language-based Action Concept Spaces Improve Video Self-Supervised Learning

07/20/2023
by   Kanchana Ranasinghe, et al.
0

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.

READ FULL TEXT
research
07/16/2022

LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training

Generating representations of video data is of key importance in advanci...
research
03/22/2023

Self-distillation for surgical action recognition

Surgical scene understanding is a key prerequisite for contextaware deci...
research
06/13/2020

DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition

State-of-the-art video action recognition models with complex network ar...
research
04/26/2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention a...
research
09/29/2022

REST: REtrieve Self-Train for generative action recognition

This work is on training a generative action/video recognition model who...
research
11/29/2021

Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations

A fundamental goal in neuroscience is to understand the relationship bet...
research
10/14/2020

Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning

In this paper we show that learning video feature spaces in which tempor...

Please sign up or login with your details

Forgot password? Click here to reset