LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

06/11/2023
by   Zhenfei Yin, et al.
0

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.

READ FULL TEXT

page 4

page 5

page 8

research
06/15/2023

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Although instruction-tuned large language models (LLMs) have exhibited r...
research
06/07/2023

M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Instruction tuning has significantly advanced large language models (LLM...
research
05/18/2023

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Multi-modal large language models are regarded as a crucial step towards...
research
05/24/2023

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal ...
research
06/15/2022

Prefix Language Models are Unified Modal Learners

With the success of vision-language pre-training, we have witnessed the ...
research
07/03/2023

Visual Instruction Tuning with Polite Flamingo

Recent research has demonstrated that the multi-task fine-tuning of mult...
research
10/13/2022

Unified Vision and Language Prompt Learning

Prompt tuning, a parameter- and data-efficient transfer learning paradig...

Please sign up or login with your details

Forgot password? Click here to reset