TouchStone: Evaluating Vision-Language Models by Language Models

08/31/2023
by   Shuai Bai, et al.
0

Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

READ FULL TEXT

page 2

page 14

page 15

page 16

research
08/07/2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
research
08/19/2023

GameEval: Evaluating LLMs on Conversational Games

The rapid advancements in large language models (LLMs) have presented ch...
research
06/09/2023

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Recently, large language models (LLMs) have made significant advancement...
research
07/30/2023

Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models

The field of large language models (LLMs) has made significant progress,...
research
05/17/2023

Evaluating Object Hallucination in Large Vision-Language Models

Inspired by the superior language abilities of large language models (LL...
research
06/08/2023

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Despite the existence of various benchmarks for evaluating natural langu...
research
03/20/2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...

Please sign up or login with your details

Forgot password? Click here to reset