MixGen: A New Multi-Modal Data Augmentation

06/16/2022
by   Xiaoshuai Hao, et al.
36

Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2 on Flicker30K zero-shot), visual grounding (+0.9 reasoning (+0.9 and visual entailment (+0.4

READ FULL TEXT

page 3

page 7

research
04/27/2023

Retrieval-based Knowledge Augmented Vision Language Pre-training

With recent progress in large-scale vision and language representation l...
research
04/20/2023

A data augmentation perspective on diffusion models and retrieval

Diffusion models excel at generating photorealistic images from text-que...
research
04/29/2022

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results ...
research
07/27/2023

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Recent advancements in surgical computer vision applications have been d...
research
04/11/2023

ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

Deep learning has shown great potential in assisting radiologists in rea...
research
12/08/2017

Music Transcription by Deep Learning with Data and "Artificial Semantic" Augmentation

In this progress paper the previous results of the single note recogniti...
research
08/18/2023

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

With the success of self-supervised learning, multimodal foundation mode...

Please sign up or login with your details

Forgot password? Click here to reset