BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

01/30/2023
by   Junnan Li, et al.
0

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7 parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

READ FULL TEXT

page 4

page 5

research
08/23/2023

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Building scalable vision-language models to learn from diverse, multimod...
research
05/18/2023

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Vision-language pre-training (VLP) models have been demonstrated to be e...
research
08/10/2022

Alternating Cross-attention Vision-Language Model for Efficient Learning with Medical Image and Report without Curation

Recent advances in vision-language pre-training have demonstrated astoun...
research
06/07/2023

UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

Large-scale joint training of multimodal models, e.g., CLIP, have demons...
research
12/08/2021

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...
research
07/13/2023

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

We present a novel methodology aimed at optimizing the application of fr...
research
05/31/2023

Joint Adaptive Representations for Image-Language Learning

Image-language learning has made unprecedented progress in visual unders...

Please sign up or login with your details

Forgot password? Click here to reset