Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

10/24/2020
by   Liunian Harold Li, et al.
0

Pre-trained contextual vision-and-language (V L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V L models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2023

PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

Vision-language (VL) Pre-training (VLP) has shown to well generalize VL ...
research
03/30/2021

Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays

Pre-trained models, e.g., from ImageNet, have proven to be effective in ...
research
04/13/2020

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Large-scale pre-training methods of learning cross-modal representations...
research
10/24/2022

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at lea...
research
09/21/2023

Weakly-supervised Automated Audio Captioning via text only training

In recent years, datasets of paired audio and captions have enabled rema...
research
06/09/2023

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Most existing word alignment methods rely on manual alignment datasets o...
research
03/27/2020

Weakly Supervised Dataset Collection for Robust Person Detection

To construct an algorithm that can provide robust person detection, we p...

Please sign up or login with your details

Forgot password? Click here to reset