UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

08/19/2023
by   Hao Feng, et al.
0

In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.

READ FULL TEXT

page 2

page 4

page 5

page 7

page 10

page 11

page 12

research
09/20/2023

Kosmos-2.5: A Multimodal Literate Model

We present Kosmos-2.5, a multimodal literate model for machine reading o...
research
03/27/2023

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

Pre-trained large language models have recently achieved ground-breaking...
research
05/15/2020

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...
research
06/01/2023

Can Large Pre-trained Models Help Vision Models on Perception Tasks?

The recent upsurge in pre-trained large models (e.g. GPT-4) has swept ac...
research
10/09/2019

Exploring Hate Speech Detection in Multimodal Publications

In this work we target the problem of hate speech detection in multimoda...
research
06/22/2023

Generative Multimodal Entity Linking

Multimodal Entity Linking (MEL) is the task of mapping mentions with mul...
research
08/17/2022

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

Lecture slide presentations, a sequence of pages that contain text and f...

Please sign up or login with your details

Forgot password? Click here to reset