CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

09/14/2022
by   Hongwei Xue, et al.
0

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

READ FULL TEXT
research
11/22/2022

X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Vision language pre-training aims to learn alignments between vision and...
research
12/06/2022

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

We present a simple approach which can turn a ViT encoder into an effici...
research
04/18/2021

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...
research
03/01/2022

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Vision-and-Language (V+L) pre-training models have achieved tremendous s...
research
09/03/2021

Revisiting 3D ResNets for Video Recognition

A recent work from Bello shows that training and scaling strategies may ...
research
11/17/2022

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled imag...
research
01/26/2022

Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest inc...

Please sign up or login with your details

Forgot password? Click here to reset