CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

12/14/2021
by   Jianjie Luo, et al.
0

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

READ FULL TEXT
research
10/23/2020

ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding

Language model pre-training has shown promising results in various downs...
research
11/26/2020

A Recurrent Vision-and-Language BERT for Navigation

Accuracy of many visiolinguistic tasks has benefited significantly from ...
research
07/02/2022

Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Recently, the cross-modal pre-training task has been a hotspot because o...
research
08/26/2023

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Cross-modal retrieval has become popular in recent years, particularly w...
research
05/18/2021

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without ...
research
01/05/2023

Learning Trajectory-Word Alignments for Video-Language Tasks

Aligning objects with words plays a critical role in Image-Language BERT...
research
09/05/2021

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-lan...

Please sign up or login with your details

Forgot password? Click here to reset