GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

07/07/2023
by   Shilong Zhang, et al.
0

Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, data, and demo can be found at https://github.com/jshilong/GPT4RoI.

READ FULL TEXT

page 1

page 6

page 7

page 8

page 9

page 11

research
04/17/2023

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated ...
research
05/08/2023

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

We present a vision and language model named MultiModal-GPT to conduct m...
research
07/04/2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Document understanding refers to automatically extract, analyze and comp...
research
06/01/2023

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for emp...
research
09/07/2023

ImageBind-LLM: Multi-modality Instruction Tuning

We present ImageBind-LLM, a multi-modality instruction tuning method of ...
research
06/23/2023

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Multimodal Large Language Model (MLLM) relies on the powerful LLM to per...
research
06/26/2023

Large Multimodal Models: Notes on CVPR 2023 Tutorial

This tutorial note summarizes the presentation on “Large Multimodal Mode...

Please sign up or login with your details

Forgot password? Click here to reset