DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

02/08/2022
by   Jaemin Cho, et al.
10

Generating images from textual descriptions has gained a lot of attention. Recently, DALL-E, a multimodal transformer language model, and its variants have shown high-quality text-to-image generation capabilities with a simple architecture and training objective, powered by large-scale training data and computation. However, despite the interesting image generation results, there has not been a detailed analysis on how to evaluate such models. In this work, we investigate the reasoning capabilities and social biases of such text-to-image generative transformers in detail. First, we measure four visual reasoning skills: object recognition, object counting, color recognition, and spatial relation understanding. For this, we propose PaintSkills, a diagnostic dataset and evaluation toolkit that measures these four visual reasoning skills. Second, we measure the text alignment and quality of the generated images based on pretrained image captioning, image-text retrieval, and image classification models. Third, we assess social biases in the models. For this, we suggest evaluation of gender and racial biases of text-to-image generation models based on a pretrained image-text retrieval model and human evaluation. In our experiments, we show that recent text-to-image models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between model performances and oracle accuracy on all skills. Next, we demonstrate that recent text-to-image models learn specific gender/racial biases from web image-text pairs. We also show that our automatic evaluations of visual reasoning skills and gender bias are highly correlated with human judgments. We hope our work will help guide future progress in improving text-to-image models on visual reasoning skills and social biases. Code and data at: https://github.com/j-min/DallEval

READ FULL TEXT

page 1

page 2

page 3

page 8

page 9

page 13

page 14

page 15

research
05/24/2023

Visual Programming for Text-to-Image Generation and Evaluation

As large language models have demonstrated impressive performance in man...
research
04/11/2023

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

In recent years, Text-to-Image (T2I) models have been extensively studie...
research
08/03/2022

DALLE-URBAN: Capturing the urban design expertise of large text to image transformers

Automatically converting text descriptions into images using transformer...
research
07/29/2022

Testing Relational Understanding in Text-Guided Image Generation

Relations are basic building blocks of human cognition. Classic and rece...
research
10/27/2022

How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?

Text-to-image generative models have achieved unprecedented success in g...
research
03/10/2023

New Benchmarks for Accountable Text-based Visual Re-creation

Given a command, humans can directly execute the action after thinking o...
research
08/11/2023

DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

The unprecedented photorealistic results achieved by recent text-to-imag...

Please sign up or login with your details

Forgot password? Click here to reset