Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

11/22/2022
by   Vitali Petsiuk, et al.
0

We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of thirty-two tasks over multiple applications that capture a model's ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to identify its ability to match objects and attributes correctly. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) and human ratings for each generated image.

READ FULL TEXT

page 7

page 8

page 9

research
07/12/2023

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Despite the stunning ability to generate high-quality images by recent t...
research
01/11/2023

SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-task Learning

In this paper, we develop a novel benchmark suite including both a 2D sy...
research
06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...
research
06/07/2023

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

The ability to understand visual concepts and replicate and compose thes...
research
08/14/2022

Text Difficulty Study: Do machines behave the same as humans regarding text difficulty?

Given a task, human learns from easy to hard, whereas the model learns r...
research
10/29/2019

Semantic Object Accuracy for Generative Text-to-Image Synthesis

Generative adversarial networks conditioned on simple textual image desc...
research
12/15/2022

TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models

Evaluating and comparing text-to-image models is a challenging problem. ...

Please sign up or login with your details

Forgot password? Click here to reset