Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

02/02/2023
by   Hao Liu, et al.
0

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.

READ FULL TEXT

page 2

page 3

page 5

page 12

research
10/09/2021

Vector-quantized Image Modeling with Improved VQGAN

Pretraining language models with next-token prediction on massive text c...
research
06/25/2021

Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibi...
research
10/10/2022

Parameter-Efficient Tuning with Special Token Adaptation

Parameter-efficient tuning aims at updating only a small subset of param...
research
09/14/2022

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Effective scaling and a flexible task interface enable large language mo...
research
07/09/2022

Towards Multimodal Vision-Language Models Generating Non-Generic Text

Vision-language models can assess visual context in an image and generat...
research
07/16/2023

Planting a SEED of Vision in Large Language Model

We present SEED, an elaborate image tokenizer that empowers Large Langua...
research
05/25/2022

RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning

Prompting has shown impressive success in enabling large pretrained lang...

Please sign up or login with your details

Forgot password? Click here to reset