Clover: Towards A Unified Video-Language Alignment and Fusion Model

07/16/2022
by   Jingjia Huang, et al.
0

Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent attempts train the models, usually consisting of uni-modal and cross-modal feature encoders, with supervised or pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. We argue the flaws are caused by their pre-training strategies—they cannot well align and fuse features from different modalities simultaneously. We then introduce Clover – a Correlated Video-Language pre-training method – towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from masked samples and a novel pair-wise ranking loss. Clover demonstrates outstanding generality. It establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://github.com/LeeYN-43/Clover.

READ FULL TEXT
research
12/17/2021

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on vari...
research
07/11/2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Video-language pre-training (VLP) has become increasingly important due ...
research
01/05/2023

Learning Trajectory-Word Alignments for Video-Language Tasks

Aligning objects with words plays a critical role in Image-Language BERT...
research
05/20/2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach...
research
08/18/2021

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

With the rise and development of deep learning over the past decade, the...
research
10/11/2022

Learning to Locate Visual Answer in Video Corpus Using Question

We introduce a new task, named video corpus visual answer localization (...
research
04/13/2023

Verbs in Action: Improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects inter...

Please sign up or login with your details

Forgot password? Click here to reset