TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

05/23/2023
by   Ziyun Zeng, et al.
3

The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS.

READ FULL TEXT

page 2

page 4

page 15

research
12/06/2022

Fine-tuned CLIP Models are Efficient Video Learners

Large-scale multi-modal training with image-text pairs imparts strong ge...
research
11/23/2022

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Many recent studies leverage the pre-trained CLIP for text-video cross-m...
research
05/11/2022

Clinical Prompt Learning with Frozen Language Models

Prompt learning is a new paradigm in the Natural Language Processing (NL...
research
06/09/2022

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

To build an artificial neural network like the biological intelligence s...
research
04/26/2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...
research
01/17/2023

Masked Visual Reconstruction in Language Semantic Space

Both masked image modeling (MIM) and natural language supervision have f...
research
03/23/2023

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Foundation models have shown outstanding performance and generalization ...

Please sign up or login with your details

Forgot password? Click here to reset