Visual Instruction Tuning

04/17/2023
by   Haotian Liu, et al.
7

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1 instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53 GPT-4 generated visual instruction tuning data, our model and code base publicly available.

READ FULL TEXT

page 3

page 7

page 14

page 15

page 17

page 18

page 19

research
04/06/2023

Instruction Tuning with GPT-4

Prior work has shown that finetuning large language models (LLMs) using ...
research
07/03/2023

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Instruction finetuning is a popular paradigm to align large language mod...
research
05/11/2023

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

General-purpose language models that can solve various language-domain t...
research
08/08/2023

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significa...
research
05/08/2023

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

We present a vision and language model named MultiModal-GPT to conduct m...
research
07/07/2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Instruction tuning large language model (LLM) on image-text pairs has ac...
research
05/24/2023

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal ...

Please sign up or login with your details

Forgot password? Click here to reset