Joint Adaptive Representations for Image-Language Learning

05/31/2023
by   AJ Piergiovanni, et al.
0

Image-language learning has made unprecedented progress in visual understanding. These developments have come at high costs, as contemporary vision-language models require large model scales and amounts of data. We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets. Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features. This results in a more effective image-language learning, greatly lowering the FLOPs by combining and reducing the number of tokens for both text and images, e.g. a 33% reduction in FLOPs is achieved, compared to baseline fusion techniques used by popular image-language models, while improving performance. This also allows the model to scale without a large increase in FLOPs or memory. In addition, we propose adaptive pre-training data sampling which improves the data efficiency. The proposed approach achieves competitive performance compared to much larger models, and does so with significantly less data and FLOPs. With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly pro...
research
09/14/2023

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Starting from the resurgence of deep learning, vision-language models (V...
research
06/15/2022

Prefix Language Models are Unified Modal Learners

With the success of vision-language pre-training, we have witnessed the ...
research
06/01/2023

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Pre-trained language models (PLMs) have played an increasing role in mul...
research
05/24/2023

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal ...
research
05/03/2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Deploying large language models (LLMs) is challenging because they are m...
research
05/22/2022

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Despite their wide adoption, the underlying training and memorization dy...

Please sign up or login with your details

Forgot password? Click here to reset