Too Large; Data Reduction for Vision-Language Pre-Training

05/31/2023
by   Alex Jinpeng Wang, et al.
0

This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.82M to 0.67M (∼24%) and noisy YFCC15M from 15M to 2.5M (∼16.7%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset. The code will be made available at <https://github.com/showlab/data-centric.vlp>.

READ FULL TEXT

page 6

page 7

research
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
research
02/17/2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The availability of large-scale image captioning and visual question ans...
research
07/05/2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

In this work, we present Auto-captions on GIF, which is a new large-scal...
research
08/08/2022

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Most of the currently existing vision and language pre-training (VLP) me...
research
11/26/2022

Cross-Field Transformer for Diabetic Retinopathy Grading on Two-feld Fundus Images

Automatic diabetic retinopathy (DR) grading based on fundus photography ...
research
03/15/2022

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

Large-scale datasets play a vital role in computer vision. Existing data...
research
05/23/2023

Decoupled Rationalization with Asymmetric Learning Rates: A Flexible Lipshitz Restraint

A self-explaining rationalization model is generally constructed by a co...

Please sign up or login with your details

Forgot password? Click here to reset