mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

04/27/2023
by   Qinghao Ye, et al.
0

Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

READ FULL TEXT

page 4

page 6

page 7

page 9

page 10

page 11

page 12

page 13

research
07/04/2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Document understanding refers to automatically extract, analyze and comp...
research
07/17/2023

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

LLMs have demonstrated remarkable abilities at interacting with humans t...
research
06/12/2023

Valley: Video Assistant with Large Language model Enhanced abilitY

Recently, several multi-modal models have been developed for joint image...
research
09/05/2023

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Large language models (LLMs) like ChatGPT have revealed amazing intellig...
research
05/04/2023

Few-shot Domain-Adaptive Visually-fused Event Detection from Text

Incorporating auxiliary modalities such as images into event detection m...
research
05/02/2023

SIA-FTP: A Spoken Instruction Aware Flight Trajectory Prediction Framework

Ground-air negotiation via speech communication is a vital prerequisite ...
research
04/26/2021

MDETR – Modulated Detection for End-to-End Multi-Modal Understanding

Multi-modal reasoning systems rely on a pre-trained object detector to e...

Please sign up or login with your details

Forgot password? Click here to reset