Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

04/04/2023
by   Mayu Otani, et al.
0

Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.

READ FULL TEXT

page 6

page 7

page 13

page 14

page 15

page 19

page 20

page 21

research
04/12/2023

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

We present ImageReward – the first general-purpose text-to-image human p...
research
05/13/2022

The Creativity of Text-to-Image Generation

Text-to-image synthesis has made a giant leap towards becoming a mainstr...
research
01/17/2021

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Leaderboards have eased model development for many NLP datasets by stand...
research
04/01/2019

HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

Generative models often use human evaluations to measure the perceived q...
research
05/28/2023

Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World's Ugliness?

Text-conditioned image generation models have recently achieved astonish...
research
04/01/2019

HYPE: Human eYe Perceptual Evaluation of Generative Models

Generative models often use human evaluations to determine and justify p...
research
04/11/2023

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

In recent years, Text-to-Image (T2I) models have been extensively studie...

Please sign up or login with your details

Forgot password? Click here to reset