BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

07/17/2023
by   fcq, et al.
0

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .

READ FULL TEXT

page 8

page 10

page 11

page 16

page 17

page 19

page 20

page 21

research
04/27/2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Large language models (LLMs) have demonstrated impressive zero-shot abil...
research
06/15/2023

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Although instruction-tuned large language models (LLMs) have exhibited r...
research
06/16/2022

RefCrowd: Grounding the Target in Crowd with Referring Expressions

Crowd understanding has aroused the widespread interest in vision domain...
research
05/18/2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

In this work, we explore a scalable way for building a general represent...
research
05/24/2005

Multi-Modal Human-Machine Communication for Instructing Robot Grasping Tasks

A major challenge for the realization of intelligent robots is to supply...
research
11/05/2019

Contextual Grounding of Natural Language Entities in Images

In this paper, we introduce a contextual grounding approach that capture...
research
08/24/2021

Support-Set Based Cross-Supervision for Video Grounding

Current approaches for video grounding propose kinds of complex architec...

Please sign up or login with your details

Forgot password? Click here to reset