Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

05/23/2023
by   Emanuele Bugliarello, et al.
2

Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we add supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional views of images. With masked relation prediction, we further encourage relating entities from visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.

READ FULL TEXT

page 14

page 15

research
10/09/2022

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Multimodal representation learning has shown promising improvements on v...
research
05/12/2023

Measuring Progress in Fine-grained Vision-and-Language Understanding

While pretraining on large-scale image-text data from the Web has facili...
research
12/20/2019

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Recent breakthroughs of pretrained language models have shown the effect...
research
05/20/2023

What Makes for Good Visual Tokenizers for Large Language Models?

We empirically investigate proper pre-training methods to build good vis...
research
10/04/2013

Weakly supervised clustering: Learning fine-grained signals from coarse labels

Consider a classification problem where we do not have access to labels ...
research
03/17/2023

Enhancing the Role of Context in Region-Word Alignment for Object Detection

Vision-language pretraining to learn a fine-grained, region-word alignme...
research
10/21/2022

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Recent advances in vision-and-language modeling have seen the developmen...

Please sign up or login with your details

Forgot password? Click here to reset