Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

07/13/2023
by   Yiren Jian, et al.
0

We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code is available at https://github.com/yiren-jian/BLIText

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly pro...
research
03/11/2023

DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation

Text-driven image manipulation remains challenging in training or infere...
research
10/11/2022

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

With the advent of Transformers, large language models (LLMs) have satur...
research
08/01/2018

Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features

Due to the fact that it is prohibitively expensive to completely annotat...
research
07/06/2023

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Large web-sourced multimodal datasets have powered a slew of new methods...
research
05/31/2023

Improving CLIP Training with Language Rewrites

Contrastive Language-Image Pre-training (CLIP) stands as one of the most...
research
05/23/2023

Parts of Speech-Grounded Subspaces in Vision-Language Models

Latent image representations arising from vision-language models have pr...

Please sign up or login with your details

Forgot password? Click here to reset