Spatio-Temporal Crop Aggregation for Video Representation Learning

11/30/2022
by   Sepehr Sameni, et al.
0

We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. The video representation is then obtained by taking the ensemble of the concatenation of embeddings of separate video clips with a video clip set summarization token. These techniques make our method not only extremely efficient to train, but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and k-NN probing on common action classification datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2019

Video Representation Learning by Dense Predictive Coding

The objective of this paper is self-supervised learning of spatio-tempor...
research
06/20/2020

Video Playback Rate Perception for Self-supervisedSpatio-Temporal Representation Learning

In self-supervised spatio-temporal representation learning, the temporal...
research
08/13/2020

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation...
research
08/24/2017

Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction

Frame-level visual features are generally aggregated in time with the te...
research
10/10/2022

Turbo Training with Token Dropout

The objective of this paper is an efficient training method for video ta...
research
11/26/2020

t-EVA: Time-Efficient t-SNE Video Annotation

Video understanding has received more attention in the past few years du...
research
05/31/2019

3DPalsyNet: A Facial Palsy Grading and Motion Recognition Framework using Fully 3D Convolutional Neural Networks

The capability to perform facial analysis from video sequences has signi...

Please sign up or login with your details

Forgot password? Click here to reset