StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

08/20/2023
by   Yanda Li, et al.
0

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities,

READ FULL TEXT

page 1

page 3

page 4

page 5

page 8

page 9

research
08/21/2023

Instruction Tuning for Large Language Models: A Survey

This paper surveys research works in the quickly advancing field of inst...
research
07/03/2023

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Instruction finetuning is a popular paradigm to align large language mod...
research
08/08/2023

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significa...
research
09/18/2023

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Visual instruction tuning has recently shown encouraging progress with o...
research
08/22/2023

Towards an On-device Agent for Text Rewriting

Large Language Models (LLMs) have demonstrated impressive capabilities f...
research
05/07/2023

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Large language models (LLMs) have demonstrated remarkable language abili...
research
05/24/2023

PIVOINE: Instruction Tuning for Open-world Information Extraction

We consider the problem of Open-world Information Extraction (Open-world...

Please sign up or login with your details

Forgot password? Click here to reset