In Defense of Image Pre-Training for Spatiotemporal Recognition

05/03/2022
by   Xianhang Li, et al.
6

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6 the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/16/2022

FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization

We present FactPEGASUS, an abstractive summarization model that addresse...
research
04/10/2020

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

How can we collect and use a video dataset to further improve spatiotemp...
research
12/13/2022

FastMIM: Expediting Masked Image Modeling Pre-training for Vision

The combination of transformers and masked image modeling (MIM) pre-trai...
research
03/15/2022

P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (...
research
04/02/2023

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

In this paper, we study masked autoencoder (MAE) pretraining on videos f...
research
05/30/2023

Occ-BEV: Multi-Camera Unified Pre-training via 3D Scene Reconstruction

Multi-camera 3D perception has emerged as a prominent research field in ...
research
05/23/2023

VisorGPT: Learning Visual Prior via Generative Pre-Training

Various stuff and things in visual data possess specific traits, which c...

Please sign up or login with your details

Forgot password? Click here to reset