Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

09/14/2023
by   Zhiwu Qing, et al.
0

Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7 of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.

READ FULL TEXT
research
03/15/2022

P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (...
research
10/08/2022

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

While recent large-scale video-language pre-training made great progress...
research
06/29/2023

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Foundation models have exhibited remarkable success in various applicati...
research
08/14/2023

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Zero-shot video recognition (ZSVR) is a task that aims to recognize vide...
research
01/26/2023

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Image-text pretrained models, e.g., CLIP, have shown impressive general ...
research
12/06/2022

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

We present a simple approach which can turn a ViT encoder into an effici...
research
06/05/2023

Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos

Localization of the narrowest position of the vessel and corresponding v...

Please sign up or login with your details

Forgot password? Click here to reset