PaLI: A Jointly-Scaled Multilingual Language-Image Model

09/14/2022
by   Xi Chen, et al.
6

Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

READ FULL TEXT

page 3

page 6

research
08/22/2022

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

A big convergence of language, vision, and multimodal pretraining is eme...
research
07/13/2023

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Modular vision-language models (Vision-LLMs) align pretrained image enco...
research
02/28/2023

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

While vision transformers have been highly successful in improving the p...
research
03/29/2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to de...
research
02/02/2023

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Recent progress in scaling up large language models has shown impressive...
research
12/26/2019

Vision and Language: from Visual Perception to Content Creation

Vision and language are two fundamental capabilities of human intelligen...
research
07/28/2023

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

We study how vision-language models trained on Internet-scale data can b...

Please sign up or login with your details

Forgot password? Click here to reset