Data Efficient Masked Language Modeling for Vision and Language

09/05/2021
by   Yonatan Bitton, et al.
0

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource settings. Further, our pre-training approach substantially outperforms the baseline model on a prompt-based probing task designed to elicit image objects. These results and our analysis indicate that our method allows for better utilization of the training data.

READ FULL TEXT

page 1

page 4

page 5

page 15

page 16

research
10/23/2020

ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding

Language model pre-training has shown promising results in various downs...
research
09/10/2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...
research
03/14/2022

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Pre-trained language models are still far from human performance in task...
research
06/17/2022

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Existing vision-language pre-training (VLP) methods primarily rely on pa...
research
05/27/2023

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Large-scale vision language (VL) models use Transformers to perform cros...
research
12/14/2021

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...
research
06/25/2021

Learning to Sample Replacements for ELECTRA Pre-Training

ELECTRA pretrains a discriminator to detect replaced tokens, where the r...

Please sign up or login with your details

Forgot password? Click here to reset