LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

06/15/2023
by   Peng Xu, et al.
0

Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 8 representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates 6 categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on 47 standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at https://github.com/OpenGVLab/Multi-Modality-Arena

READ FULL TEXT

page 4

page 5

page 9

page 18

page 22

page 23

research
05/13/2023

On the Hidden Mystery of OCR in Large Multimodal Models

Large models have recently played a dominant role in natural language pr...
research
08/07/2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
research
05/17/2023

Evaluating Object Hallucination in Large Vision-Language Models

Inspired by the superior language abilities of large language models (LL...
research
05/29/2023

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision...
research
08/12/2023

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for...
research
06/23/2023

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Multimodal Large Language Model (MLLM) relies on the powerful LLM to per...
research
09/05/2023

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Large language models (LLMs) like ChatGPT have revealed amazing intellig...

Please sign up or login with your details

Forgot password? Click here to reset