ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

05/25/2023
by   Zijia Zhao, et al.
0

Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.

READ FULL TEXT

page 9

page 18

page 19

page 20

page 23

page 25

research
05/25/2023

PandaGPT: One Model To Instruction-Follow Them All

We present PandaGPT, an approach to emPower large lANguage moDels with v...
research
02/27/2023

Language Is Not All You Need: Aligning Perception with Language Models

A big convergence of language, multimodal perception, action, and world ...
research
05/07/2023

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Large language models (LLMs) have demonstrated remarkable language abili...
research
05/10/2023

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

We present Integrated Multimodal Perception (IMP), a simple and scalable...
research
08/08/2023

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

This paper presents OmniDataComposer, an innovative approach for multimo...
research
09/12/2023

The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable generalization...
research
09/11/2023

NExT-GPT: Any-to-Any Multimodal LLM

While recently Multimodal Large Language Models (MM-LLMs) have made exci...

Please sign up or login with your details

Forgot password? Click here to reset