LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

06/29/2023
by   Yanzhe Zhang, et al.
0

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20 accuracy of 91.42 evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

READ FULL TEXT

page 3

page 8

page 9

page 18

page 19

page 20

page 21

research
04/28/2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

How to efficiently transform large language models (LLMs) into instructi...
research
09/13/2023

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Multi-modal large language models (MLLMs) are trained based on large lan...
research
05/19/2023

LLM Itself Can Read and Generate CXR Images

Building on the recent remarkable development of large language models (...
research
06/12/2023

Valley: Video Assistant with Large Language model Enhanced abilitY

Recently, several multi-modal models have been developed for joint image...
research
06/08/2023

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

High-quality instructions and responses are essential for the zero-shot ...
research
08/19/2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Vision Language Models (VLMs), which extend Large Language Models (LLM) ...
research
10/16/2021

Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals

The ability to sequence unordered events is an essential skill to compre...

Please sign up or login with your details

Forgot password? Click here to reset