On the Hidden Mystery of OCR in Large Multimodal Models

05/13/2023
by   Yuliang Liu, et al.
0

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition, text-based visual question answering, and key information extraction. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting fine-grained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline will be available at https://github.com/Yuliang-Liu/MultimodalOCR.

READ FULL TEXT
research
06/15/2023

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have recently played a dominant rol...
research
05/23/2023

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

We propose a novel multimodal video benchmark - the Perception Test - to...
research
08/07/2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
research
11/27/2022

Understanding BLOOM: An empirical study on diverse NLP tasks

In this work, we present an evaluation of smaller BLOOM model variants (...
research
04/01/2022

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Large foundation models can exhibit unique capabilities depending on the...
research
05/23/2023

i-Code Studio: A Configurable and Composable Framework for Integrative AI

Artificial General Intelligence (AGI) requires comprehensive understandi...
research
04/23/2023

Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

The capability of Large Language Models (LLMs) like ChatGPT to comprehen...

Please sign up or login with your details

Forgot password? Click here to reset