VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

08/12/2023
by   Yonatan Bitton, et al.
0

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27 practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 7

page 21

page 22

page 23

research
05/04/2023

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

This project focuses on enhancing open-source large language models thro...
research
04/06/2023

Instruction Tuning with GPT-4

Prior work has shown that finetuning large language models (LLMs) using ...
research
08/16/2023

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Data contamination, i.e., the presence of test data from downstream task...
research
06/15/2023

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have recently played a dominant rol...
research
06/08/2023

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Instruction tuning large language models (LLMs) remains a challenging ta...
research
05/18/2023

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Existing automatic evaluation on text-to-image synthesis can only provid...
research
07/17/2023

AlpaGasus: Training A Better Alpaca with Fewer Data

Large language models (LLMs) obtain instruction-following capability thr...

Please sign up or login with your details

Forgot password? Click here to reset