Align and Prompt: Video-and-Language Pre-training with Entity Prompts

12/17/2021
by   Dongxu Li, et al.
0

Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models are available at https://github.com/salesforce/ALPRO.

READ FULL TEXT

page 1

page 6

page 12

research
08/04/2022

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Large-scale vision-language pre-training has shown impressive advances i...
research
07/16/2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...
research
01/26/2023

Improving Cross-modal Alignment for Text-Guided Image Inpainting

Text-guided image inpainting (TGII) aims to restore missing regions base...
research
08/24/2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Cross-modal alignment is one key challenge for Vision-and-Language Navig...
research
12/07/2022

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

This paper presents SimVTP: a Simple Video-Text Pretraining framework vi...
research
01/05/2023

Learning Trajectory-Word Alignments for Video-Language Tasks

Aligning objects with words plays a critical role in Image-Language BERT...
research
03/16/2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-la...

Please sign up or login with your details

Forgot password? Click here to reset