mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

07/04/2023
by   Jiabo Ye, et al.
0

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

READ FULL TEXT

page 3

page 4

page 5

page 6

page 7

page 9

page 10

research
04/27/2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Large language models (LLMs) have demonstrated impressive zero-shot abil...
research
05/11/2023

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

General-purpose language models that can solve various language-domain t...
research
08/18/2023

PUMGPT: A Large Vision-Language Model for Product Understanding

Recent developments of multi-modal large language models have demonstrat...
research
07/07/2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Instruction tuning large language model (LLM) on image-text pairs has ac...
research
11/07/2022

Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding

We present a dataset generator engine named Web-based Visual Corpus Buil...
research
05/23/2021

CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding

Scientific document understanding is challenging as the data is highly d...
research
08/17/2023

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

3D scene understanding has gained significant attention due to its wide ...

Please sign up or login with your details

Forgot password? Click here to reset