PandaGPT: One Model To Instruction-Follow Them All

05/25/2023
by   Yixuan Su, et al.
0

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

READ FULL TEXT

page 3

page 4

page 5

page 6

page 8

page 12

page 13

page 14

research
05/25/2023

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Building general-purpose models that can perceive diverse real-world mod...
research
05/09/2023

ImageBind: One Embedding Space To Bind Them All

We present ImageBind, an approach to learn a joint embedding across six ...
research
05/07/2023

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Large language models (LLMs) have demonstrated remarkable language abili...
research
09/11/2023

NExT-GPT: Any-to-Any Multimodal LLM

While recently Multimodal Large Language Models (MM-LLMs) have made exci...
research
06/08/2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Conversation agents fueled by Large Language Models (LLMs) are providing...
research
07/30/2023

Unified Model for Image, Video, Audio and Language Tasks

Large Language Models (LLMs) have made the ambitious quest for generalis...
research
08/08/2023

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

This paper presents OmniDataComposer, an innovative approach for multimo...

Please sign up or login with your details

Forgot password? Click here to reset