Learning Trajectory-Word Alignments for Video-Language Tasks

01/05/2023
by   Xu Yang, et al.
0

Aligning objects with words plays a critical role in Image-Language BERT (IL-BERT) and Video-Language BERT (VDL-BERT). Different from the image case where an object covers some spatial patches, an object in a video usually appears as an object trajectory, i.e., it spans over a few spatial but longer temporal patches and thus contains abundant spatiotemporal contexts. However, modern VDL-BERTs neglect this trajectory characteristic that they usually follow IL-BERTs to deploy the patch-to-word (P2W) attention while such attention may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment for solving video-language tasks. Such alignment is learned by a newly designed trajectory-to-word (T2W) attention. Besides T2W attention, we also follow previous VDL-BERTs to set a word-to-patch (W2P) attention in the cross-modal encoder. Since T2W and W2P attentions have diverse structures, our cross-modal encoder is asymmetric. To further help this asymmetric cross-modal encoder build robust vision-language associations, we propose a fine-grained “align-before-fuse” strategy to pull close the embedding spaces calculated by the video and text encoders. By the proposed strategy and T2W attention, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.

READ FULL TEXT

page 1

page 7

page 11

page 13

research
12/17/2021

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on vari...
research
07/16/2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...
research
09/17/2020

Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

In this paper, we introduce Cross-modal Alignment with mixture experts N...
research
12/14/2021

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...
research
12/20/2022

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

Despite recent progress towards scaling up multimodal vision-language mo...
research
07/21/2022

Correspondence Matters for Video Referring Expression Comprehension

We investigate the problem of video Referring Expression Comprehension (...
research
04/25/2022

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

Reasoning about causal and temporal event relations in videos is a new d...

Please sign up or login with your details

Forgot password? Click here to reset