Frozen CLIP Models are Efficient Video Learners

08/06/2022
by   Ziyi Lin, et al.
0

Video recognition has been dominated by the end-to-end learning paradigm – first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) – an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

READ FULL TEXT

page 21

page 23

research
08/21/2023

UnLoc: A Unified Framework for Video Localization Tasks

While large-scale image-text pretrained models such as CLIP have been us...
research
07/26/2022

Unsupervised Contrastive Learning of Image Representations from Ultrasound Videos with Hard Negative Mining

Rich temporal information and variations in viewpoints make video data a...
research
03/21/2021

PGT: A Progressive Method for Training Models on Long Videos

Convolutional video models have an order of magnitude larger computation...
research
03/30/2023

Streaming Video Model

Video understanding tasks have traditionally been modeled by two separat...
research
11/18/2022

Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Video summarization aims to select the most informative subset of frames...
research
01/09/2022

Glance and Focus Networks for Dynamic Visual Recognition

Spatial redundancy widely exists in visual recognition tasks, i.e., disc...
research
12/28/2021

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Recent works have shown that the computational efficiency of video recog...

Please sign up or login with your details

Forgot password? Click here to reset