Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

08/04/2021
by   Rui Qian, et al.
6

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.

READ FULL TEXT
research
11/22/2016

Learning Multi-level Features For Sensor-based Human Action Recognition

This paper proposes a multi-level feature learning framework for human a...
research
01/26/2023

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Image-text pretrained models, e.g., CLIP, have shown impressive general ...
research
02/24/2023

Language-Driven Representation Learning for Robotics

Recent work in visual representation learning for robotics demonstrates ...
research
04/03/2019

VideoBERT: A Joint Model for Video and Language Representation Learning

Self-supervised learning has become increasingly important to leverage t...
research
10/09/2020

A Cross-Level Information Transmission Network for Predicting Phenotype from New Genotype: Application to Cancer Precision Medicine

An unsolved fundamental problem in biology and ecology is to predict obs...
research
07/21/2021

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Despite video forecasting has been a widely explored topic in recent yea...
research
03/30/2022

Controllable Augmentations for Video Representation Learning

This paper focuses on self-supervised video representation learning. Mos...

Please sign up or login with your details

Forgot password? Click here to reset