Language Is Not All You Need: Aligning Perception with Language Models

02/27/2023
by   Shaohan Huang, et al.
0

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

READ FULL TEXT

page 1

page 2

page 3

research
06/26/2023

Kosmos-2: Grounding Multimodal Large Language Models to the World

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enablin...
research
08/24/2023

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

We introduce the Qwen-VL series, a set of large-scale vision-language mo...
research
07/11/2023

Generative Pretraining in Multimodality

We present Emu, a Transformer-based multimodal foundation model, which c...
research
05/05/2023

LMEye: An Interactive Perception Network for Large Language Models

Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, ...
research
05/23/2023

i-Code Studio: A Configurable and Composable Framework for Integrative AI

Artificial General Intelligence (AGI) requires comprehensive understandi...
research
05/25/2023

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Building general-purpose models that can perceive diverse real-world mod...
research
05/23/2023

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

We propose a novel multimodal video benchmark - the Perception Test - to...

Please sign up or login with your details

Forgot password? Click here to reset