CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

06/21/2021
by   Han Fang, et al.
0

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

READ FULL TEXT

page 2

page 8

research
04/18/2021

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...
research
03/29/2021

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawi...
research
06/05/2022

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand info...
research
08/23/2021

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Contrastive learning has been widely used to train transformer-based vis...
research
07/16/2022

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received...
research
01/26/2023

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Image-text pretrained models, e.g., CLIP, have shown impressive general ...
research
05/29/2022

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Large-scale pretrained transformers have created milestones in text (GPT...

Please sign up or login with your details

Forgot password? Click here to reset