Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

01/26/2023
by   Ruyang Liu, et al.
0

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) – a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN

READ FULL TEXT

page 4

page 13

page 14

page 15

research
08/04/2021

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

The crux of self-supervised video representation learning is to build ge...
research
02/09/2023

Diverse Human Motion Prediction Guided by Multi-Level Spatial-Temporal Anchors

Predicting diverse human motions given a sequence of historical poses ha...
research
06/21/2021

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

We present CLIP2Video network to transfer the image-language pre-trainin...
research
09/14/2023

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Recently, large-scale pre-trained language-image models like CLIP have s...
research
04/30/2021

BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification

In this paper, we present an efficient spatial-temporal representation f...
research
10/15/2018

3D Feature Pyramid Attention Module for Robust Visual Speech Recognition

Visual speech recognition is the task to decode the speech content from ...
research
07/27/2022

One-Trimap Video Matting

Recent studies made great progress in video matting by extending the suc...

Please sign up or login with your details

Forgot password? Click here to reset