LAMP: Label Augmented Multimodal Pretraining

12/08/2020
by   Jia Guo, et al.
0

Multi-modal representation learning by pretraining has become an increasing interest due to its easy-to-use and potential benefit for various Visual-and-Language (V-L) tasks. However its requirement of large volume and high-quality vision-language pairs highly hinders its values in practice. In this paper, we proposed a novel label-augmented V-L pretraining model, named LAMP, to address this problem. Specifically, we leveraged auto-generated labels of visual objects to enrich vision-language pairs with fine-grained alignment and correspondingly designed a novel pretraining task. Besides, we also found such label augmentation in second-stage pretraining would further universally benefit various downstream tasks. To evaluate LAMP, we compared it with some state-of-the-art models on four downstream tasks. The quantitative results and analysis have well proven the value of labels in V-L pretraining and the effectiveness of LAMP.

READ FULL TEXT

page 1

page 2

research
04/27/2023

Retrieval-based Knowledge Augmented Vision Language Pre-training

With recent progress in large-scale vision and language representation l...
research
12/01/2022

Localization vs. Semantics: How Can Language Benefit Visual Representation Learning?

Despite the superior performance brought by vision-and-language pretrain...
research
05/25/2022

Primitive3D: 3D Object Dataset Synthesis from Randomly Assembled Primitives

Numerous advancements in deep learning can be attributed to the access t...
research
04/19/2020

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

Numerous recent works have proposed pretraining generic visio-linguistic...
research
03/29/2023

Towards Understanding the Effect of Pretraining Label Granularity

In this paper, we study how pretraining label granularity affects the ge...
research
03/10/2023

HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Human-centric perceptions include a variety of vision tasks, which have ...
research
02/05/2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pretraining (VLP) has improved performance on variou...

Please sign up or login with your details

Forgot password? Click here to reset